Cross-Selling of Credit Cards¶

  • Author: Nils Fahrni
  • Date: 14. June 2024

Table of contents

  • Cross-Selling of Credit Cards
    • Data Overview
    • Data Preprocessing
      • Data Loading
        • Account
        • Client
        • Disposition
        • Permanent Order
        • Transaction
        • Loan
        • Credit Card
        • District
      • Data Merging
      • Data Cleaning
        • Removing Junior Cards
    • Model Construction
      • Processing Transactional Data
      • Defining Roll-Up Windows of Transactions
      • Validating Negative Examples
        • Pivoting the transactions
      • Bringing the data together
    • Exploratory Data Analysis
      • Exploring Missing Values
      • Cardholders vs Non-Cardholders
      • Card Types
      • Distribution of Issuance Dates
      • Balance and Volume over time
      • Volume before card purchase
      • Age distribution
    • Feature Engineering
      • Demographic-Based Features
        • Defining Age Groups
        • Age at loan
        • Account age
      • Financial Features
        • Average Monthly Payment Ratio
      • District Features
        • Client-Account Region Match
        • District Population Density
      • Event-Based Statistics
      • Looking at the effect of the new features
        • Impact of variance
        • Looking at temporal client traits
    • Train-Test-Split
    • Baseline Model
    • Further Model Candidates
      • Logistic Regression
      • Decision Tree
      • Random Forest
      • Bagging
      • Boosting
    • Comparing the Model Candidates
      • ROC Curves
      • Confusion Matrices
      • Other Metrics
      • Top Customer Lists
      • Choosing a Final Model
    • Model Reduction
      • Selecting Features
    • Model Explanation
      • Evaluation Metrics
      • Variable Importance
      • Partial Dependence
      • Variable Attribution
    • Conclusive Value of the Final Model
In [ ]:
import warnings
import sys
import os

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns

plt.style.use('ggplot')

pio.templates.default = 'ggplot2'
pio.renderers.default = "notebook"
pd.set_option('display.max_columns', None)

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

np.random.seed(1337)

sys.path.append(parent_dir)

Data Overview¶

Entity Relationship Diagram

Data Preprocessing¶

In the following section the goal is to load in all artifacts from the PKDD'99 Dataset and to merge them together into a single dataframe that will allow for further processing steps.

Data Loading¶

To load the PKDD'99 Dataset, a DataLoader class was created to streamline the onloading process. The dataset already has a few structural flaws, such as the dates being represented as YYMMDD, so just a single integer number. The DataLoader object takes care of this flaw and imports all date columns in actual datetype formats. This will later on allow for simpler date-specific operations and interpretation.

Another task that the DataLoader takes on is the translation. The PKDD'99 Dataset stems from a Czech bank which means that a lot of column names and categorical values are in Czech language. To tranlsate these columns and their values into English, a translation_mappings.json file inside the /data directory was created. This JSON file helps to control which values and attributes need to get translated and their respective new name. The file can be passed into the DataLoader, resulting in new translated names among all variables.

In [ ]:
from src.data_utils import DataLoader

data_loader = DataLoader(base_path='data', translations_name='translation_mappings.json')
data_loader.list_datasets()
Out[ ]:
Dataset Number of Rows
0 loan 682
1 client 5369
2 district 77
3 trans 1056320
4 account 4500
5 card 892
6 order 6471
7 disp 5369

To make sure everything will get loaded correctly, the instantiated data_loader object has a list_datasets() method which yields a glimpse of the .csv-files inside the /data directory.

Account¶

First, the Accounts will get loaded in. As can be seen in the Entity Relationship Diagram, this table is the centerpiece of all the relations inside the dataset. As already touched upon, the date variables are represented as 6-digit integers inside the .csv files. This will get addressed during the loading process. For this a date parsing pattern gets passed into the function so that we can interpret the unformatted parts of the integers and convert them into the corresponding time piece (year, month and day).

In [ ]:
account = data_loader.load_csv('account', parse_dates={'date': '%y%m%d'})
account.sample(5)
Mapped frequency:
{
    "POPLATEK MESICNE": "MONTHLY CHARGES",
    "POPLATEK TYDNE": "WEEKLY CHARGES",
    "POPLATEK PO OBRATU": "TRANSACTION CHARGES"
}
Out[ ]:
account_id district_id frequency date
2103 3646 63 WEEKLY CHARGES 1995-10-20
1721 981 66 MONTHLY CHARGES 1995-03-25
1193 2501 30 MONTHLY CHARGES 1994-02-26
2892 2604 68 MONTHLY CHARGES 1996-07-04
252 1189 59 MONTHLY CHARGES 1993-03-19

The sample shows that the translation mappings also got correctly applied, the frequency column was translated correctly into english.

Client¶

The next table is the client .csv. It also has multiple relations: Once to the Dispositions and once to the Districts.

Inside the client table there also was a special remark: The birth_number variable has the sex of each client encoded. The male "birth numbers" are just date integers as we already noticed in the first section. The female birth numbers however have an added 50 days on each date integer. Since sex/gender could also be of value to us later on, the birthdates get "fixed" by turning them both into regular datetime values. Additionally the gender will be written into another column based on if the day-part of the date integer is greater or lower than 50. The birthdate also allows us to directly extract the age of the clients in year, so an age column gets added by subtracting the birthdate from a proposed "key date" which is 1999-12-31. Of course in a real-world scenario this date would be the current date and not some arbituary past date.

In [ ]:
client = data_loader.load_csv('client')

client = client.assign(gender=client['birth_number'].apply(lambda x: 'FEMALE' if int(str(x)[2:4]) > 50 else 'MALE'))

client = client.assign(birth_number=client.apply(lambda x: x['birth_number'] - 5000 if x['gender'] == 'FEMALE' else x['birth_number'], axis=1))

client['birth_number'] = pd.to_datetime(client['birth_number'], format='%y%m%d')
client['birth_date'] = client['birth_number'].apply(lambda x: x - pd.DateOffset(years=100) if x.year > 1999 else x)
client.drop('birth_number', axis=1, inplace=True)

client['age'] = (pd.to_datetime('1999-12-31') - client['birth_date']).dt.days // 365

client.sample(5)
Out[ ]:
client_id district_id gender birth_date age
4105 4336 19 FEMALE 1969-08-02 30
2112 2229 68 MALE 1944-09-07 55
1463 1547 28 FEMALE 1927-12-20 72
3951 4177 34 MALE 1935-12-25 64
3835 4052 74 FEMALE 1950-10-17 49

The sample shows that the dates got formatted correctly and our two new columns (gender and age) have been added correctly. Within this table no translations were needed, hence no translation-mapping output.

Disposition¶

The disposition table holds information on which clients have the "right" to issue permanent orders and ask for a loan. The owners of each account are marked as "OWNER". Other users relations to an account are marked as "DISPONENT".

It is assumed that only account owners would have the right to make a permanent order like buying a new card, hence the decision was made to remove non-owners (disponents). Disponents may be secondary users that have been authorized to use an account. They may be allowed to execute transactions on that account but they are not the authorized owners.

In [ ]:
disp = data_loader.load_csv('disp')

disp = disp[disp['type'] != 'DISPONENT']

disp.drop('type', axis=1, inplace=True)

disp.sample(5)
Out[ ]:
disp_id client_id account_id
4478 4737 4737 3928
5044 9512 9820 7944
3781 3994 3994 3302
1450 1531 1531 1269
1439 1520 1520 1259

Permanent Order¶

A permanent order record describes an order made between two accounts; The account_id (sender) sends amount to account_to (beneficiary) at bank bank_to. Each can also be marked with a k_symbol (type of payment).

In [ ]:
order = data_loader.load_csv('order')

order.sample(5)
Mapped k_symbol:
{
    "POJISTNE": "INSURANCE PAYMENT",
    "SIPO": "HOUSEHOLD",
    "LEASING": "LEASING",
    "UVER": "LOAN PAYMENT"
}
Out[ ]:
order_id account_id bank_to account_to amount k_symbol
5126 35080 3854 QR 36950737 1290.0 HOUSEHOLD
2680 32349 2016 ST 81232386 157.0 INSURANCE PAYMENT
403 29843 291 YZ 27026367 12925.0 NaN
5868 40382 7437 EF 2692229 6901.0 HOUSEHOLD
5952 41067 7861 OP 58806037 400.0 NaN

Here, the sample shows that the k_symbol column got translated correctly.

It seems that not every order has a labeled type of payment. To validate this observation we will take a closer look at orders with an undefined k_symbol.

In [ ]:
display(order.isnull().sum())

display(order[order['k_symbol'].isnull()].sample(5))
order_id         0
account_id       0
bank_to          0
account_to       0
amount           0
k_symbol      1379
dtype: int64
order_id account_id bank_to account_to amount k_symbol
2746 32417 2057 YZ 1773274 970.0 NaN
69 29475 48 ST 2071685 1029.0 NaN
6420 45778 11013 IJ 43103534 9499.0 NaN
1788 31375 1342 ST 45400208 1011.0 NaN
2847 32533 2139 GH 92438989 4737.0 NaN

The other variables do not show additional abnormality so it can be concluded that the k_symbol is not a mandatory field for every order.

To still keep the information of these undefined orders a "MISSING" label will get added.

The orders contain granular information on which account forwarded money to another bank/recipient. To build a client summary at the end these orders need to get compressed into valuable information which can be assigned to a client - Therefore, the k_symbol gets pivoted and for each order a count is added to the respective order type.

In [ ]:
order['k_symbol'] = order['k_symbol'].fillna('MISSING')

order_pivot = order.pivot_table(index='account_id', columns='k_symbol', values='amount', aggfunc='count', fill_value=0)

order_pivot.sample(5)
Out[ ]:
k_symbol HOUSEHOLD INSURANCE PAYMENT LEASING LOAN PAYMENT MISSING
account_id
552 1 0 0 0 0
198 2 0 0 0 0
578 1 1 0 0 2
1583 0 0 0 1 0
226 1 0 0 1 0

Transaction¶

The transactions seem to be fairly similar to the orders. They also get categorized by a k_symbol attribute. This category's values should therefore also be translated.

In [ ]:
transaction = data_loader.load_csv('trans', parse_dates={'date': '%y%m%d'})

transaction.sample(5)
Mapped type:
{
    "PRIJEM": "CREDIT",
    "VYDAJ": "WITHDRAWAL"
}
Mapped operation:
{
    "VYBER KARTOU": "CREDIT CARD WITHDRAWAL",
    "VKLAD": "CREDIT IN CASH",
    "PREVOD Z UCTU": "COLLECTION FROM ANOTHER BANK",
    "VYBER": "WITHDRAWAL IN CASH",
    "PREVOD NA UCET": "REMITTANCE TO ANOTHER BANK"
}
Mapped k_symbol:
{
    "POJISTNE": "INSURANCE PAYMENT",
    "SLUZBY": "PAYMENT FOR STATEMENT",
    "UROK": "INTEREST CREDITED",
    "SANKC. UROK": "SANCTION INTEREST IF NEGATIVE BALANCE",
    "SIPO": "HOUSEHOLD",
    "DUCHOD": "OLD-AGE PENSION",
    "UVER": "LOAN PAYMENT"
}
Out[ ]:
trans_id account_id date type operation amount balance k_symbol bank account
926595 3444421 3223 1998-07-31 CREDIT NaN 92.9 14640.3 INTEREST CREDITED NaN NaN
715788 134286 450 1997-12-12 CREDIT COLLECTION FROM ANOTHER BANK 4393.0 16076.3 OLD-AGE PENSION AB 94329887.0
673209 3564940 1046 1997-10-31 CREDIT NaN 130.1 27631.5 INTEREST CREDITED NaN NaN
776368 204567 694 1998-01-31 WITHDRAWAL WITHDRAWAL IN CASH 14.6 29151.1 PAYMENT FOR STATEMENT NaN NaN
1028116 3600996 2141 1998-11-30 CREDIT NaN 283.4 70926.4 INTEREST CREDITED NaN NaN

Loan¶

The loan table holds records of the loans given out to certain accounts. It contains the loan amount, the duration of it in months, the monthly interest (payments) and a status.

The status is a categorical variable which gets mapped to some clearer status descriptions.

In [ ]:
loan = data_loader.load_csv('loan', parse_dates={'date': '%y%m%d'})

loan.sample(5)
Mapped status:
{
    "A": "contract finished, no problems",
    "B": "contract finished, loan not payed",
    "C": "running contract, OK so far",
    "D": "running contract, client in debt"
}
Out[ ]:
loan_id account_id date amount duration payments status
629 5015 226 1998-07-12 109344 12 9112.0 running contract, OK so far
323 6818 9030 1996-12-12 155616 48 3242.0 running contract, OK so far
670 6168 5698 1998-11-21 99216 36 2756.0 running contract, OK so far
407 6278 6265 1997-06-10 87360 24 3640.0 running contract, OK so far
27 5189 1166 1994-02-07 149040 48 3105.0 contract finished, no problems

According to the Entity Relationship Diagram, an account can in theory have multiple loans. If an account has multiple loans that would make the preprocessing a bit harder since we cannot match a loan 1:1 into a summarizing table. Therefore, a first check would be to look if there are multiple loans for a given account:

In [ ]:
print(f'Are there accounts with multiple loans: {loan["account_id"].nunique() < loan.shape[0]}')
Are there accounts with multiple loans: False

The check yielded that there are only unique accounts for all loans, so no account has more than one loan.

Credit Card¶

The credit card table holds all information on which card is held by which owner/account through the disposition interim table. It also holds information on the card type. The card types are already branded in english so no translational mapping needs to be made.

In [ ]:
card = data_loader.load_csv('card', parse_dates={'issued': '%y%m%d'})

card.sample(5)
Out[ ]:
card_id disp_id type issued
159 854 6753 classic 1996-10-07
205 137 786 junior 1997-01-10
445 481 3102 junior 1998-01-06
350 19 130 classic 1997-09-09
533 510 3256 classic 1998-04-13

District¶

The district table contains information on the domicile of the accounts and the domicil of the clients.

This table has numerous variables with id-like names. These need to get renamed for clarity:

In [ ]:
district = data_loader.load_csv('district')

district = district.rename(columns={
    'A1': 'district_id',
    'A2': 'district_name',
    'A3': 'region',
    'A4': 'population',
    'A5': 'n_municipalities_with_inhabitants_lt_499',
    'A6': 'n_municipalities_with_inhabitants_500_to_1999',
    'A7': 'n_municipalities_with_inhabitants_2000_to_9999',
    'A8': 'n_municipalities_with_inhabitants_gt_10000',
    'A9': 'n_cities',
    'A10': 'ratio_urban_inhabitants',
    'A11': 'average_salary',
    'A12': 'unemployment_rate_95',
    'A13': 'unemployment_rate_96',
    'A14': 'enterpreneurs_per_1000_inhabitants',
    'A15': 'n_commited_crimes_95',
    'A16': 'n_commited_crimes_96'
})

district.sample(5)
Out[ ]:
district_id district_name region population n_municipalities_with_inhabitants_lt_499 n_municipalities_with_inhabitants_500_to_1999 n_municipalities_with_inhabitants_2000_to_9999 n_municipalities_with_inhabitants_gt_10000 n_cities ratio_urban_inhabitants average_salary unemployment_rate_95 unemployment_rate_96 enterpreneurs_per_1000_inhabitants n_commited_crimes_95 n_commited_crimes_96
70 71 Novy Jicin north Moravia 161227 5 35 11 4 10 69.7 8678 5.93 5.57 102 4980 4595
72 73 Opava north Moravia 182027 17 49 12 2 7 56.4 8746 3.33 3.74 90 4355 4433
75 76 Sumperk north Moravia 127369 31 32 13 2 7 51.2 8369 4.73 5.88 107 3736 2807
25 26 Plzen - mesto west Bohemia 170449 0 0 0 1 1 100.0 10787 2.64 3.09 131 6041 6261
55 56 Breclav south Moravia 124605 11 42 14 1 8 46.0 8772 4.69 4.98 126 2854 3158

Data Merging¶

Since every table was now loaded into memory, the next goal is to merge the data together into one Client Dataframe (client_df). The centerpiece in this merging process is the account table since it is the heartpiece of the dataset and all relations center around it.

In [ ]:
from src.data_utils import add_prefix_except_id

account = add_prefix_except_id(account, 'account_', id_exceptions=['district_id'])
client_df = disp.merge(account, on='account_id', how='left')

client = add_prefix_except_id(client, 'client_', id_exceptions=['district_id'])
client_df = client_df.merge(client, on='client_id', how='left')

order = add_prefix_except_id(order_pivot, 'ordertype_')
client_df = client_df.merge(order, on='account_id', how='left')

loan = add_prefix_except_id(loan, 'loan_')
client_df = client_df.merge(loan, on='account_id', how='left')

card = add_prefix_except_id(card, 'card_')
client_df = client_df.merge(card, on='disp_id', how='left')

client_district = add_prefix_except_id(district, 'client_district_')
client_df = client_df.merge(client_district, left_on='client_district_id', right_on='district_id', how='left')

account_district = add_prefix_except_id(district, 'account_district_')
client_df = client_df.merge(account_district, left_on='account_district_id', right_on='district_id', how='left')

client_df.sample(5)

n_merged_base_client = client_df.shape[0]

To see if we now have one row for each account, the following assertion is being made:

In [ ]:
assert client_df['account_id'].nunique() == client_df.shape[0]

The assertion runs through, therefore we can assume that the merging process was successful and no account was accidentally merged more than once into the client_df dataframe.

Data Cleaning¶

The next step is to clean the dataset.

Removing Junior Cards¶

The task at hand requires to only predict if a customer will get a classic or gold card. Junior cards are to be dismissed in this task.

In [ ]:
junior_clients = client_df[client_df['card_type'] == 'junior']

client_df = client_df[~client_df['account_id'].isin(junior_clients['account_id'])]

transaction = transaction[~transaction['account_id'].isin(junior_clients['account_id'])]

client_df['has_card'] = client_df['card_id'].notnull()

print(f'Number of junior clients: {junior_clients.shape[0]}')
print(f'Number of clients remaining: {client_df.shape[0]}')
Number of junior clients: 145
Number of clients remaining: 4355

Additionally, clients without cards could potentially be in the age range of someone who would naturally get a Junior Card. Since there is no information on how these special cases should be handled, the clients in that age bracket will be removed in the following step.

To find out what clients should get removed from the dataset, firts the age distribution needs to be examined to find a suitable "cutoff age".

In [ ]:
JUNIOR_AGE_CUTOFF = 21

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(junior_clients['client_age'], bins=40, ax=ax[0], kde=True)
ax[0].axvline(JUNIOR_AGE_CUTOFF, color='r', linestyle='--')
ax[0].text(JUNIOR_AGE_CUTOFF - .25, 10, 'Junior Age Cutoff', rotation=90, color='r')
ax[0].set_title('Junior Clients Age Distribution')
ax[0].set_xlabel('Age')

sns.histplot(client_df['client_age'], bins=20, ax=ax[1], color='b', kde=True)
ax[1].axvline(JUNIOR_AGE_CUTOFF, color='r', linestyle='--')
ax[1].text(JUNIOR_AGE_CUTOFF - 2, 100, 'Junior Age Cutoff', rotation=90, color='r')
ax[1].set_title('Non-Junior Clients Age Distribution')
ax[1].set_xlabel('Age')

plt.show()
No description has been provided for this image

For the cutoff the age 21 was chosen (meaning non-card-holders that are younger than the threshold or exactly at the threshold) since we can see a rapid decrease in Junior Card Holders after the proposed age. These remaining customers above the cutoff age that still own a Junior card could just remain there due to the reason that the bank may need some time to reissue regular cards; Which is why no clear cutoff age can be found in the distribution.

In [ ]:
underage_clients = client_df[(client_df['client_age'] <= JUNIOR_AGE_CUTOFF) & (~client_df['has_card'])]

client_df = client_df[~client_df['client_id'].isin(underage_clients['client_id'])]
transaction = transaction[~transaction['account_id'].isin(underage_clients['account_id'])]

print(f'Number of underage clients: {underage_clients.shape[0]}')
print(f'Number of clients remaining after underage client removal: {client_df.shape[0]}')
Number of underage clients: 222
Number of clients remaining after underage client removal: 4133

Model Construction¶

In the model construction step the main goal is to build a "counter quantity". The PKDD'99 dataset unfortunately only labelled preexisiting customers with an issue date. Customers who decided against a card may not be in the dataset or are at least not marked in such a way.

Processing Transactional Data¶

The first task is to look if every account in the transactions dataframe has a "first transaction". This would make the calculation of the monthly balance much easier since everything can be summed up without having to worry that there were months without records in the transaction dataframe.

In [ ]:
min_dates = transaction.groupby('account_id')['date'].min().reset_index()
min_dates.rename(columns={'date': 'min_date'}, inplace=True)

transactions_with_min_date = pd.merge(transaction, min_dates, on='account_id')

first_day_transactions = transactions_with_min_date[transactions_with_min_date['date'] == transactions_with_min_date['min_date']]
first_day_transactions = first_day_transactions.copy()

first_day_transactions['amount_equals_balance'] = first_day_transactions['amount'] == first_day_transactions['balance']

accounts_meeting_condition = first_day_transactions.groupby('account_id')['amount_equals_balance'].any().reset_index()

all_accounts_covered = accounts_meeting_condition['amount_equals_balance'].all()

print("Does every account's first day of transactions include at least one transaction where amount equals balance?", all_accounts_covered)
Does every account's first day of transactions include at least one transaction where amount equals balance? True

Since the goal is to have a roll-up window per month, the transactions need to get modified so that they aggregate their values on a per-month level.

The code first groups the transactions by account_id and month, and aggregates the data to calculate monthly volume, credit, withdrawal, and the number of transactions (n_transactions). After that the minimum and maximum month for each account_id gets calculated.

Finally, the cumulative sum of the volume column for each account_id is calculated, resulting in a running balance over time.

In [ ]:
transaction['month'] = transaction['date'].dt.to_period('M')

transactions_monthly = transaction.groupby(['account_id', 'month']).agg(
    volume=('amount', 'sum'),
    credit=('amount', lambda x: x[transaction.loc[x.index, 'type'] == 'CREDIT'].sum()),
    withdrawal=('amount', lambda x: x[transaction.loc[x.index, 'type'] == 'WITHDRAWAL'].sum()),
    n_transactions=('amount', 'size')
).reset_index()

transactions_monthly['month'] = pd.PeriodIndex(transactions_monthly['month'])

date_ranges = transactions_monthly.groupby('account_id')['month'].agg(['min', 'max'])

def reindex_df(group, account_id):
    idx = pd.period_range(start=group['month'].min(), end=group['month'].max(), freq='M')
    group.set_index('month', inplace=True)
    group = group.reindex(idx, fill_value=0)
    group.reset_index(inplace=True)
    group.rename(columns={'index': 'month'}, inplace=True)
    group['account_id'] = account_id
    return group

transactions_monthly = (transactions_monthly.groupby('account_id')
                        .apply(lambda x: reindex_df(x, x.name))
                        .reset_index(level=0, drop=True))

transactions_monthly['balance'] = transactions_monthly.groupby('account_id')['volume'].cumsum()

transactions_monthly.sample(5)
Out[ ]:
month account_id volume credit withdrawal n_transactions balance
5 1998-01 2644 17485.7 5064.1 12421.6 13 46045.3
22 1995-05 1297 42004.4 20720.8 21283.6 7 933208.2
15 1998-08 250 11099.8 6699.2 4400.6 6 202891.1
23 1998-08 1258 5210.7 2963.1 2247.6 6 127985.3
36 1996-05 3316 33962.0 15492.4 18469.6 5 1210751.5

And now as the sample output shows, we have an aggregation for all months and accounts with a summary on the monthly credit, withdrawal, balance and n_transactions.

Defining Roll-Up Windows of Transactions¶

Before we can continue to construct the roll-up windows for all customers, we need to propose a hypothetical card_issued date for the clients that have no card (negative samples).

To not introduce a new bias towards card-holders or non-card-holders the idea here is to sample issue dates from the card holders. An important point is to eliminate or surpress seasonality as good as possible (hence a 13 month roll-up window will be constructed). Such seasonality could also emerge if we just sample random dates - To not encounter that issue and get a counter set to our existing card holders, the distribution of days between account creation and the issue date is used. This way we can ensure that we don't sample dates that lie outside of a non-card customer's account activity.

In [ ]:
clients_with_cards = client_df[~client_df['card_issued'].isnull()]
time_between_creation_and_issue = (clients_with_cards['card_issued'] - clients_with_cards['account_date']).dt.days

plt.figure(figsize=(10, 6))
sns.histplot(time_between_creation_and_issue, bins=50, stat='percent', kde=False, color='blue', linewidth=1)

plt.title('Distribution of Days between Account Creation and Card Issue Date')
plt.xlabel('Days')
plt.ylabel('Percentage of Accounts')
plt.grid(True)

plt.show()
No description has been provided for this image

When looking at the distribution of how many days it usually takes an account to get a card is that usually cards get issued the earliest 200 days after account creation. The issuances decrease steadily but in rare cases it can even take a customer over 5 years (>2000 days) to decide to get a credit card.

Based on this distribution we now will sample the deltas (time between account creation and issuance) for our non-card holders.

In [ ]:
np.random.seed(1337)

sampled_deltas = np.random.choice(time_between_creation_and_issue, size=len(client_df[client_df['card_issued'].isnull()]))

plt.figure(figsize=(10, 6))
sns.histplot(sampled_deltas, bins=50, stat='percent', kde=False, color='blue', linewidth=1, label='Sampled Deltas')
sns.histplot(time_between_creation_and_issue, bins=50, stat='percent', kde=False, color='orange', linewidth=1, label='Original Deltas')

plt.title('Distribution of Days between Account Creation and Card Issue Date')
plt.xlabel('Days')
plt.ylabel('Percentage of Accounts')
plt.legend()
plt.grid(True)

plt.show()
No description has been provided for this image

Through the random sampling we now can use every negative sample from the dataset. Additionally, through random sampling we added a small amount of noise which is expected to have a generalizing effect on the data.

To check if we now have an issue date for all customers the following check is executed:

In [ ]:
if len(client_df[client_df['card_issued'].isnull()]) > 0:
    client_df.loc[client_df['card_issued'].isnull(), 'card_issued'] = client_df.loc[client_df['card_issued'].isnull(), 'account_date'] + pd.to_timedelta(sampled_deltas, unit='D')
    
print(f'Number of NaT/NaN values in card_issued: {client_df["card_issued"].isnull().sum()}')
Number of NaT/NaN values in card_issued: 0

The check shows that we now successfully sampled an issue date for all customers.

In [ ]:
card_issued = card.groupby('disp_id')['card_issued'].min().reset_index()
card_issued.head(5)
Out[ ]:
disp_id card_issued
0 9 1998-10-16
1 19 1998-03-13
2 41 1995-09-03
3 42 1998-11-26
4 51 1995-04-24

We now can merge together the issued dates to the monthly transaction aggregation.

In [ ]:
transactions_monthly = transactions_monthly.merge(client_df[['account_id', 'card_issued']], left_on='account_id', right_on='account_id', how='left')

transactions_monthly.head()
Out[ ]:
month account_id volume credit withdrawal n_transactions balance card_issued
0 1995-03 1 1000.0 1000.0 0.0 1 1000.0 1998-09-18
1 1995-04 1 16298.2 16298.2 0.0 3 17298.2 1998-09-18
2 1995-05 1 5858.0 5858.0 0.0 3 23156.2 1998-09-18
3 1995-06 1 3979.6 3779.6 200.0 3 27135.8 1998-09-18
4 1995-07 1 9087.9 3787.9 5300.0 3 36223.7 1998-09-18

Now, let's see if the join worked correctly and we don't have any transactions without a card_issued date anymore.

In [ ]:
assert transactions_monthly['card_issued'].isnull().sum() == 0

Validating Negative Examples¶

The goal for our model will be to be able to learn from a base set of customers (with and without cards) that lived through the same economic circumstances. To validate that all customers in our current set are represented in a well distributed way, this next plot looks at the distribution of volume throughout the datasets timeframe.

In [ ]:
transactions_monthly['has_card'] = transactions_monthly['account_id'].isin(client_df[client_df['has_card']]['account_id'])

transactions_monthly.reset_index(inplace=True)
transactions_monthly['id'] = range(len(transactions_monthly))
transactions_monthly.set_index('id', inplace=True)

transactions_with_cards = transactions_monthly[transactions_monthly['has_card'] == True]
transactions_without_cards = transactions_monthly[transactions_monthly['has_card'] == False]

grouped_with_cards = transactions_with_cards.groupby('month')['volume'].sum().reset_index()
grouped_without_cards = transactions_without_cards.groupby('month')['volume'].sum().reset_index()

grouped_with_cards['month'] = grouped_with_cards['month'].astype(str)
grouped_without_cards['month'] = grouped_without_cards['month'].astype(str)

grouped_with_cards['avg_volume_per_client'] = grouped_with_cards['volume'] / transactions_with_cards.shape[0]
grouped_without_cards['avg_volume_per_client'] = grouped_without_cards['volume'] / transactions_without_cards.shape[0]

plt.figure(figsize=(12, 6))

sns.lineplot(x='month', y='avg_volume_per_client', data=grouped_with_cards, marker='o', label='Clients with Cards')
sns.lineplot(x='month', y='avg_volume_per_client', data=grouped_without_cards, marker='o', label='Clients without Cards')

plt.title('Transaction Volume per Month')
plt.xlabel('Month')
plt.ylabel('Average Transaction Volume per Client ($)')
plt.legend()
plt.grid(True)

plt.xticks(ticks=grouped_with_cards['month'][grouped_with_cards['month'].str.endswith('-01')], rotation=45)
plt.gca().set_xticklabels([label[:4] for label in grouped_with_cards['month'][grouped_with_cards['month'].str.endswith('-01')]])

plt.show()
No description has been provided for this image

We can see that we generally have less transaction volume for clients without cards, though the trends stay the same for both customer types: Around Christmas and New Year's Eve the transaction volume rises rapidly. The same observation can be seen in the middle of each year, perhaps where customers go on summer vacation.

In order to construct the Roll-Up Window for every customer, the difference of each month to the issue date's month needs to get calculated. After calculating the difference at each month the size of the Roll-Up window can be determined. In this case a window of 13 months was chosen, so a year plus a lag month which aims to capture the client's decision time or the banks time until the card has been sent out to the customer.

In [ ]:
transactions_monthly['card_issued'] = pd.to_datetime(transactions_monthly['card_issued'])
transactions_monthly['card_issued_period'] = transactions_monthly['card_issued'].dt.to_period('M')

transactions_monthly['month_diff'] = transactions_monthly.apply(lambda row: (row['card_issued_period'] - row['month']).n if pd.notnull(row['card_issued_period']) and pd.notnull(row['month']) else None, axis=1)

filtered_transactions = transactions_monthly[transactions_monthly['month_diff'].between(1, 13)]

transactions_monthly.sample(5)
Out[ ]:
index month account_id volume credit withdrawal n_transactions balance card_issued has_card card_issued_period month_diff
id
44634 44634 1998-05 1252 47796.5 24851.8 22944.7 5 1777803.9 1999-05-20 False 1999-05 12
44011 44011 1997-07 1235 10518.0 4411.4 6106.6 6 373341.3 1995-12-18 False 1995-12 -19
95818 95818 1997-12 2651 98919.1 54142.5 44776.6 6 4442381.9 1995-07-09 True 1995-07 -29
73459 73459 1996-08 2035 52500.5 29600.5 22900.0 3 155070.1 2000-06-14 False 2000-06 46
162431 162431 1995-08 8998 1900.0 1900.0 0.0 1 21400.0 1997-11-23 False 1997-11 27

The sample and assertions shows, as expected, the dataframe now has a month_diff column describing how many months until the issue month.

Pivoting the transactions¶

An issue at this point can be that an account may have months without any transactions, so there needs to be a more thorough process to interpolate the data:

  • Volume, Withdrawal, Credit and Number of Transactions: If there are missing months in these variables we can just set 0 as their value as there has not been any activity if there were no recorded months.
  • Balance: The balance will get recursively set to the last preceding recorded month. So if there are consecutive "missing" months in transactions the balance will always be set to the last recorded month.
In [ ]:
account_summary = pd.DataFrame(filtered_transactions['account_id'].unique(), columns=['account_id'])

variables_to_pivot = ['volume', 'withdrawal', 'credit', 'n_transactions', 'balance']

for variable in variables_to_pivot:
    grouped = filtered_transactions.groupby(['account_id', 'month_diff'])[variable].sum().reset_index()
    pivot = grouped.pivot(index='account_id', columns='month_diff', values=variable).reset_index()
    pivot.columns = ['account_id'] + [f'{variable}_month_diff_{int(col)}' if col != 'account_id' else 'account_id' for col in pivot.columns[1:]]
    account_summary = pd.merge(account_summary, pivot, on='account_id', how='left')

for variable in ['volume', 'withdrawal', 'credit', 'n_transactions']:
    account_summary.update(account_summary.filter(regex=f'^{variable}_').fillna(0))

def find_last_balance(account_id, starting_month_diff):
    higher_month_diff = transactions_monthly[
        (transactions_monthly['account_id'] == account_id) & 
        (transactions_monthly['month_diff'] >= starting_month_diff)
    ].sort_values('month_diff')
    
    last_balance_row = higher_month_diff[higher_month_diff['balance'].notna()].head(1)
    
    if not last_balance_row.empty:
        return last_balance_row.iloc[0]['balance']
    else:
        return 0

balance_columns = [col for col in account_summary.columns if 'balance_month_diff_' in col]
for idx, row in account_summary.iterrows():
    for col in balance_columns:
        if pd.isna(row[col]):
            month_diff = int(col.split('_')[-1])
            last_balance = find_last_balance(row['account_id'], month_diff + 1)
            account_summary.at[idx, col] = last_balance
            
account_summary.sample(5)
Out[ ]:
account_id volume_month_diff_1 volume_month_diff_2 volume_month_diff_3 volume_month_diff_4 volume_month_diff_5 volume_month_diff_6 volume_month_diff_7 volume_month_diff_8 volume_month_diff_9 volume_month_diff_10 volume_month_diff_11 volume_month_diff_12 volume_month_diff_13 withdrawal_month_diff_1 withdrawal_month_diff_2 withdrawal_month_diff_3 withdrawal_month_diff_4 withdrawal_month_diff_5 withdrawal_month_diff_6 withdrawal_month_diff_7 withdrawal_month_diff_8 withdrawal_month_diff_9 withdrawal_month_diff_10 withdrawal_month_diff_11 withdrawal_month_diff_12 withdrawal_month_diff_13 credit_month_diff_1 credit_month_diff_2 credit_month_diff_3 credit_month_diff_4 credit_month_diff_5 credit_month_diff_6 credit_month_diff_7 credit_month_diff_8 credit_month_diff_9 credit_month_diff_10 credit_month_diff_11 credit_month_diff_12 credit_month_diff_13 n_transactions_month_diff_1 n_transactions_month_diff_2 n_transactions_month_diff_3 n_transactions_month_diff_4 n_transactions_month_diff_5 n_transactions_month_diff_6 n_transactions_month_diff_7 n_transactions_month_diff_8 n_transactions_month_diff_9 n_transactions_month_diff_10 n_transactions_month_diff_11 n_transactions_month_diff_12 n_transactions_month_diff_13 balance_month_diff_1 balance_month_diff_2 balance_month_diff_3 balance_month_diff_4 balance_month_diff_5 balance_month_diff_6 balance_month_diff_7 balance_month_diff_8 balance_month_diff_9 balance_month_diff_10 balance_month_diff_11 balance_month_diff_12 balance_month_diff_13
381 498 39659.4 34483.7 29442.9 42462.9 65889.1 36413.9 37090.9 40801.7 34286.6 61109.7 57040.4 41033.8 41389.2 18474.6 13314.6 8314.6 21354.6 34334.6 15314.6 16014.6 19734.6 13214.6 40034.6 25494.6 19954.6 20314.6 21184.8 21169.1 21128.3 21108.3 31554.5 21099.3 21076.3 21067.1 21072.0 21075.1 31545.8 21079.2 21074.6 5.0 4.0 4.0 6.0 5.0 4.0 4.0 5.0 4.0 12.0 5.0 5.0 4.0 984707.8 945048.4 910564.7 881121.8 838658.9 772769.8 736355.9 699265.0 658463.3 624176.7 563067.0 506026.6 464992.8
1952 2492 13527.2 15698.0 14574.7 16752.0 15593.3 16138.8 12432.8 9681.2 9434.3 6932.7 7620.0 0.0 0.0 6555.6 8715.6 7595.6 9755.6 8595.6 9135.6 5420.6 2700.0 2500.0 0.0 0.0 0.0 0.0 6971.6 6982.4 6979.1 6996.4 6997.7 7003.2 7012.2 6981.2 6934.3 6932.7 7620.0 0.0 0.0 6.0 7.0 7.0 6.0 7.0 7.0 5.0 3.0 3.0 2.0 3.0 0.0 0.0 138385.0 124857.8 109159.8 94585.1 77833.1 62239.8 46101.0 33668.2 23987.0 14552.7 7620.0 0.0 0.0
310 408 32231.8 31975.1 28891.9 20395.0 30101.3 28733.4 25560.1 22529.5 19207.1 43833.1 38023.2 24737.1 25429.5 18736.6 18456.6 15356.6 6856.6 9936.6 15236.6 12056.6 9056.6 5056.6 30376.6 17876.6 11256.6 11956.6 13495.2 13518.5 13535.3 13538.4 20164.7 13496.8 13503.5 13472.9 14150.5 13456.5 20146.6 13480.5 13472.9 7.0 6.0 5.0 5.0 6.0 6.0 5.0 5.0 5.0 12.0 7.0 5.0 5.0 1101357.2 1069125.4 1037150.3 1008258.4 987863.4 957762.1 929028.7 903468.6 880939.1 861732.0 817898.9 779875.7 755138.6
1805 2312 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14220.2 12930.7 8808.9 8800.5 12903.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8832.0 7526.6 3406.6 3406.6 7506.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5388.2 5404.1 5402.3 5393.9 5397.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 6.0 4.0 4.0 5.0 153673.6 153673.6 153673.6 153673.6 153673.6 153673.6 153673.6 153673.6 153673.6 139453.4 126522.7 117713.8 108913.3
2326 2958 9079.5 20019.4 8629.8 13028.9 14947.1 5283.6 5265.0 5265.0 5265.0 5265.0 400.0 0.0 0.0 3755.6 14675.6 3275.6 7675.6 9575.6 14.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5323.9 5343.8 5354.2 5353.3 5371.5 5269.0 5265.0 5265.0 5265.0 5265.0 400.0 0.0 0.0 5.0 5.0 4.0 6.0 6.0 3.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 92448.3 83368.8 63349.4 54719.6 41690.7 26743.6 21460.0 16195.0 10930.0 5665.0 400.0 0.0 0.0

As we can see, the columns get pivoted correctly and now every account has its 13-month roll-up window so we can merge together the window with the existing client dataframe (main dataframe).

In [ ]:
client_df = client_df.merge(account_summary, on='account_id', how='inner')

Bringing the data together¶

Since we now are done with the main model construction, the following visualization aims to show and recap the process we went through with the data at hand.

In [ ]:
n_lt_13_month_hist = transactions_monthly['account_id'].nunique()-filtered_transactions['account_id'].nunique()

preprocessing_summary = [
        n_merged_base_client,
        -len(junior_clients),
        -n_lt_13_month_hist,
        -len(underage_clients),
        len(client_df)
]

fig = go.Figure(go.Waterfall(
        name = "20", orientation = "v",
        measure = ["absolute", "relative", "relative", "relative", "total"],
        x = ["Base Client List", 
             "Junior Clients", 
             "Clients with less than 13 Months of Transaction History", 
             "Non-Card-Holders that are underage", 
             "Remaining Clients"],
        textposition = "outside",
        y = preprocessing_summary,
        text = [str(x) for x in preprocessing_summary],
        connector = {"line":{"color":"rgb(63, 63, 63)"}},
))

fig.update_layout(
        title = "Profit and loss statement 2018",
        showlegend = True
)

fig.show()

The baseline dataset consists of 4500 clients. Of these 4500 clients 145 are Junior Clients and those need to get removed since the task at hand does not focus on this customer group. After the Junior card removal clients with less than 13 months of history were discarded since those do not have enough time as a customer to build a roll-up window and therefore would not be suitable to make any predictions with. An implicit dependency with the Junior cards is the age - Customers without cards that may fall into the category of Junior customers also got discarded since our model would not be suitable to classify if such "underage" customers should get a card or not. The remaining dataset or "golden record" therefore consists of 3715 customers.

Exploratory Data Analysis¶

In this section we will explore the structure of the data at hand.

Exploring Missing Values¶

The first step is to look at missing values, respecitvely finding out if there are completely empty columns.

In [ ]:
missing_values = client_df.isnull().sum()

missing_values = pd.DataFrame({
    'Number of Missing Values': missing_values[missing_values > 0],
    'Percentage of Missing Values': [
        f"{x:.2f}%" for x in (missing_values[missing_values > 0] / client_df.shape[0] * 100).round(2)
    ]
}, columns=['Number of Missing Values', 'Percentage of Missing Values'])

plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values.index, y=missing_values['Number of Missing Values'])

for index, value in enumerate(missing_values['Number of Missing Values']):
    plt.text(index, value, missing_values['Percentage of Missing Values'][index], ha='center', va='bottom')

plt.title('Missing Values in the Dataset')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')

plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
No description has been provided for this image

The visualization does not show any abnormality. Some ordertypes seem to be entirely empty since there are accounts that don't have any orders. Additionally there the same percentages across all loan-specific columns since most accounts haven't taken out loans. And at last, we can see how there are numerous customers without a card, which is also an expected observation.

Cardholders vs Non-Cardholders¶

In this next step the goal is to get an overview on how the customers are distributed when it comes to owning a card or not.

In [ ]:
card_owners = client_df[client_df['has_card']]
non_card_owners = client_df[~client_df['has_card']]
In [ ]:
data = {
    'Card Ownership': ['Card Owners', 'Non-Card Owners'],
    'Number of Clients': [card_owners.shape[0], non_card_owners.shape[0]]
}

plt.figure(figsize=(10, 6))
sns.barplot(x='Card Ownership', y='Number of Clients', data=data, palette='viridis', hue='Card Ownership')

for index, value in enumerate(data['Number of Clients']):
    plt.text(index, value, str(value), ha='center', va='bottom')

plt.title('Card Owners vs Non-Card Owners')
plt.xlabel('Card Ownership')
plt.ylabel('Number of Clients')
plt.grid(True)

plt.show()
No description has been provided for this image

The "golden record" has 747 card holders and 2968 customers without cards. The non-card owners therefore clearly overweigh.

Card Types¶

The taks doesn't require to classify gold vs. classic card holders separately, though it might be beneficial to also look at how the card owners are distributed in their "card class".

In [ ]:
card_types = card_owners['card_type'].value_counts()
card_types = card_types.reset_index()
card_types.columns = ['Card Type', 'Number of Clients']

plt.figure(figsize=(10, 6))
sns.barplot(x='Card Type', y='Number of Clients', data=card_types, palette='viridis', hue='Card Type')

for index, value in enumerate(card_types['Number of Clients']):
    plt.text(index, value, str(value), ha='center', va='bottom')

plt.title('Card Types of Card Owners')
plt.xlabel('Card Type')
plt.ylabel('Number of Clients')
plt.grid(True)

plt.show()
No description has been provided for this image

The card type distribution shows that the majority of cardholders owns classic cards. Gold cards seem to only be sold rarely.

Distribution of Issuance Dates¶

This following plot should show the distribution of card issuance dates for both Card Owners and customers without cards.

In [ ]:
plt.figure(figsize=(10, 6))
sns.histplot(card_owners['card_issued'], bins=50, kde=True, color='blue', label='Card Owners')
sns.histplot(non_card_owners['card_issued'], bins=50, kde=True, color='orange', label='Non-Card Owners')

plt.title('Distribution of Card Issue Date')
plt.xlabel('Issue Date')
plt.ylabel('Number of Clients')
plt.legend()
plt.grid(True)

plt.show()
No description has been provided for this image

Both distributions have their peak issue dates around 1997-1998. This indicates that during this period, there was a high activity of credit card issuance. It might be due to a marketing campaign, economic conditions, or other factors.

The similarity in trends between card owners and non-card owners shows that the sampling method used to fill the missing issue dates for non-card owners was appropriate and maintained the overall temporal distribution trend of the original data.

Balance and Volume over time¶

To get a glimpse on how temporal client traits might be used in modelling later on we take a look at two account samples (Account Number 14 and 18):

In [ ]:
client_df[client_df['account_id'].isin([14, 18])]
Out[ ]:
disp_id client_id account_id account_district_id account_frequency account_date client_district_id client_gender client_birth_date client_age ordertype_HOUSEHOLD ordertype_INSURANCE PAYMENT ordertype_LEASING ordertype_LOAN PAYMENT ordertype_MISSING loan_id loan_date loan_amount loan_duration loan_payments loan_status card_id card_type card_issued district_id_x client_district_district_name client_district_region client_district_population client_district_n_municipalities_with_inhabitants_lt_499 client_district_n_municipalities_with_inhabitants_500_to_1999 client_district_n_municipalities_with_inhabitants_2000_to_9999 client_district_n_municipalities_with_inhabitants_gt_10000 client_district_n_cities client_district_ratio_urban_inhabitants client_district_average_salary client_district_unemployment_rate_95 client_district_unemployment_rate_96 client_district_enterpreneurs_per_1000_inhabitants client_district_n_commited_crimes_95 client_district_n_commited_crimes_96 district_id_y account_district_district_name account_district_region account_district_population account_district_n_municipalities_with_inhabitants_lt_499 account_district_n_municipalities_with_inhabitants_500_to_1999 account_district_n_municipalities_with_inhabitants_2000_to_9999 account_district_n_municipalities_with_inhabitants_gt_10000 account_district_n_cities account_district_ratio_urban_inhabitants account_district_average_salary account_district_unemployment_rate_95 account_district_unemployment_rate_96 account_district_enterpreneurs_per_1000_inhabitants account_district_n_commited_crimes_95 account_district_n_commited_crimes_96 has_card volume_month_diff_1 volume_month_diff_2 volume_month_diff_3 volume_month_diff_4 volume_month_diff_5 volume_month_diff_6 volume_month_diff_7 volume_month_diff_8 volume_month_diff_9 volume_month_diff_10 volume_month_diff_11 volume_month_diff_12 volume_month_diff_13 withdrawal_month_diff_1 withdrawal_month_diff_2 withdrawal_month_diff_3 withdrawal_month_diff_4 withdrawal_month_diff_5 withdrawal_month_diff_6 withdrawal_month_diff_7 withdrawal_month_diff_8 withdrawal_month_diff_9 withdrawal_month_diff_10 withdrawal_month_diff_11 withdrawal_month_diff_12 withdrawal_month_diff_13 credit_month_diff_1 credit_month_diff_2 credit_month_diff_3 credit_month_diff_4 credit_month_diff_5 credit_month_diff_6 credit_month_diff_7 credit_month_diff_8 credit_month_diff_9 credit_month_diff_10 credit_month_diff_11 credit_month_diff_12 credit_month_diff_13 n_transactions_month_diff_1 n_transactions_month_diff_2 n_transactions_month_diff_3 n_transactions_month_diff_4 n_transactions_month_diff_5 n_transactions_month_diff_6 n_transactions_month_diff_7 n_transactions_month_diff_8 n_transactions_month_diff_9 n_transactions_month_diff_10 n_transactions_month_diff_11 n_transactions_month_diff_12 n_transactions_month_diff_13 balance_month_diff_1 balance_month_diff_2 balance_month_diff_3 balance_month_diff_4 balance_month_diff_5 balance_month_diff_6 balance_month_diff_7 balance_month_diff_8 balance_month_diff_9 balance_month_diff_10 balance_month_diff_11 balance_month_diff_12 balance_month_diff_13
8 19 19 14 47 MONTHLY CHARGES 1996-11-27 47 MALE 1942-12-28 57 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN 2.0 classic 1998-03-13 47 Pardubice east Bohemia 162580 83 26 5 1 6 72.8 9538 1.51 1.81 111 6079 5410 47 Pardubice east Bohemia 162580 83 26 5 1 6 72.8 9538 1.51 1.81 111 6079 5410 True 28168.3 33876.7 40562.4 31572.9 32283.0 27178.4 30076.9 28775.2 42763.1 30470.0 32125.1 28808.6 24288.4 13243.6 18943.6 18243.6 16643.6 17343.6 12243.6 15143.6 13843.6 20443.6 15543.6 17214.6 13900.0 9400.0 14924.7 14933.1 22318.8 14929.3 14939.4 14934.8 14933.3 14931.6 22319.5 14926.4 14910.5 14908.6 14888.4 5.0 5.0 6.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 4.0 3.0 3.0 459568.3 431400.0 397523.3 356960.9 325388.0 293105.0 265926.6 235849.7 207074.5 164311.4 133841.4 101716.3 72907.7
11 24 24 18 43 WEEKLY CHARGES 1993-05-26 43 FEMALE 1939-07-29 60 NaN NaN NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN 1994-04-05 43 Hradec Kralove east Bohemia 161854 73 21 6 1 6 76.3 9425 1.72 2.50 125 4085 3891 43 Hradec Kralove east Bohemia 161854 73 21 6 1 6 76.3 9425 1.72 2.50 125 4085 3891 False 94411.9 1120.4 19444.7 1900.0 0.0 200.0 2500.0 3500.0 2600.0 13500.0 1100.0 0.0 0.0 6600.0 1000.0 6000.0 1900.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 87811.9 120.4 13444.7 0.0 0.0 200.0 2500.0 3500.0 2600.0 13500.0 1100.0 0.0 0.0 5.0 2.0 3.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 140277.0 45865.1 44744.7 25300.0 23400.0 23400.0 23200.0 20700.0 17200.0 14600.0 1100.0 0.0 0.0

Account 14 has a card while account 18 does not, however both accounts otherwise are fairly similar; No loans were taken out, both clients originate from the same region, and are roughly in the same age bracket.

In [ ]:
account_14 = transactions_monthly[transactions_monthly['account_id'] == 14].copy()
account_14['month'] = account_14['month'].dt.to_timestamp()

account_18 = transactions_monthly[transactions_monthly['account_id'] == 18].copy()
account_18['month'] = account_18['month'].dt.to_timestamp()

plt.figure(figsize=(10, 6))
sns.lineplot(data=account_14, x='month', y='balance', marker='o', label='Account 14')
sns.lineplot(data=account_18, x='month', y='balance', marker='o', label='Account 18')

plt.axvspan(account_14['card_issued'].min() - pd.DateOffset(months=13), account_14['card_issued'].min(), color='r', alpha=0.3)
plt.axvspan(account_18['card_issued'].min() - pd.DateOffset(months=13), account_18['card_issued'].min(), color='b', alpha=0.3)

plt.text(account_14['card_issued'].min() - pd.DateOffset(months=13), 500000, '13 Month Rollup Window of Account 14', rotation=90, color='r')
plt.text(account_18['card_issued'].min() - pd.DateOffset(months=13), 500000, '13 Month Rollup Window of Account 18', rotation=90, color='b')

plt.title('Balance Over Time for Account 14 and Account 18')
plt.xlabel('Month')
plt.ylabel('Balance')
plt.legend()
plt.grid(True)

plt.show()
No description has been provided for this image

The balance over time show quite distinctive patterns:

  • Account 14:
    • The balance for Account 14 (represented by the red line) starts at zero and shows a steady, linear increase over time from around 1997 to 1999.
  • Account 18:
    • The balance for Account 18 shows a rather stable account balance up until the last third before issuance where a more rapid increase can be observed.

Nonetheless it can be said that both accounts follow the trend of the overall observed timespan of increasing activity over time. Both balances grew over time, so both customers seem to save up over time.

Next, the volume might also show some more behavioral patterns, therefore it gets explored the same way:

In [ ]:
plt.figure(figsize=(10, 6))
sns.lineplot(x='month', y='volume', data=account_14, marker='o', label='Account 14', color='r')
sns.lineplot(x='month', y='volume', data=account_18, marker='o', label='Account 18', color='b')

plt.axvspan(account_14['card_issued'].min() - pd.DateOffset(months=13), account_14['card_issued'].min(), color='r', alpha=0.3)
plt.axvspan(account_18['card_issued'].min() - pd.DateOffset(months=13), account_18['card_issued'].min(), color='b', alpha=0.3)

plt.text(account_14['card_issued'].min() - pd.DateOffset(months=13), 5000, '13 Month Rollup Window of Account 14', rotation=90, color='r')
plt.text(account_18['card_issued'].min() - pd.DateOffset(months=13), 5000, '13 Month Rollup Window of Account 18', rotation=90, color='b')

plt.title('Volume Over Time for Account 14 and Account 18')
plt.xlabel('Month')
plt.ylabel('Volume')
plt.legend()
plt.grid(True)
No description has been provided for this image
  • Account 18 experienced high volume activity between the years of 1994 and 1995 and another high peak between 1997 and 1998.
  • Account 14 in comparison shows a rather stable volume activity and does not peak as high as the account without a card.

When looking at both Rollup windows, we can clearly see a difference: Account 18 had a rapid increase in volume activity right before the sampled issuance date while Account 14 stayed rather stable throughout the rollup window.

Volume before card purchase¶

To now look closer at the months before an issue date, the following visualization shows the distribution of data for each month in the rollup window for card holders and customers without cards:

In [ ]:
volume_month_diff_columns = [col for col in account_summary.columns if 'volume_month_diff_' in col]

combined_data_all = []

for col in volume_month_diff_columns:
    month_num = col.split('_')[-1]
    for value in card_owners[col]:
        combined_data_all.append({'Month': month_num, 'Volume': value, 'Group': 'Card Holders'})

for col in volume_month_diff_columns:
    month_num = col.split('_')[-1]
    for value in non_card_owners[col]:
        combined_data_all.append({'Month': month_num, 'Volume': value, 'Group': 'Non-Card Holders'})

combined_df_all = pd.DataFrame(combined_data_all)

plt.figure(figsize=(12, 7))
sns.boxplot(x='Month', y='Volume', hue='Group', data=combined_df_all, palette='viridis', flierprops=dict(alpha=0.1))
plt.title('Volume of Transactions for Card Holders and Non-Card Holders')
plt.xlabel('Months before purchase')
plt.ylabel('Volume of Transactions')
plt.legend(title='Group')

plt.show()
No description has been provided for this image

Overall the cardholders show a stronger decrease during the 13-month rollup window. We can also observe that the cardholders are skewed towards a higher volume of transactions than customers without cards.

To conclude, the rollup window definitely shows a distinguishable pattern between card holders and clients without cards.

Age distribution¶

In an effort to eliminate potential Junior Card customers we have already look at the age distribution. Nonetheless, this exploration now aims to look at potential differences between cardholders and clients without cards in order to mkae out if the age is a viable feature for prediction later on.

In [ ]:
fig, ax1 = plt.subplots(figsize=(12, 6))

sns.histplot(card_owners['client_age'], bins=20, stat='percent', kde=False, color='blue', edgecolor='white', linewidth=1, label='Card Owners - Histogram', alpha=0.5, ax=ax1, zorder=2)
sns.histplot(non_card_owners['client_age'], bins=20, stat='percent', kde=False, color='orange', edgecolor='white', linewidth=1, label='Non-Card Owners - Histogram', alpha=0.5, ax=ax1, zorder=2)

ax2 = ax1.twinx()

sns.kdeplot(card_owners['client_age'], bw_adjust=0.5, color='blue', label='Card Owners - Density', ax=ax2, zorder=2)
sns.kdeplot(non_card_owners['client_age'], bw_adjust=0.5, color='orange', label='Non-Card Owners - Density', ax=ax2, zorder=2)

ax1.set_title('Distribution of Client Age')
ax1.set_xlabel('Age')
ax1.set_ylabel('Percentage of Clients')
ax2.set_ylabel('Density')

handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles1 + handles2, labels1 + labels2, loc='upper right')

plt.grid(True, zorder=-1)
plt.show()
No description has been provided for this image

The distribution shows a vast difference between card holders and non-card clients in the ages between ~65 and 90. Only a small fraction of customers in that age range own cards.

Feature Engineering¶

In the Feature Engineering section the objective is to derive new features from ones we have in hopes of finding new predictive features that may contribute to a better performance of the models later on.

Demographic-Based Features¶

Defining Age Groups¶

As already explored, the age of customers might be a reasonable predictor for card owners. Though not only the continuous representation of age may be of value for modelling but also an age bracket representation might capture the age as category in a more reasonable way:

In [ ]:
age_brackets = [0, 20, 30, 40, 50, 60, 70, 100]

plt.figure(figsize=(12, 6))
sns.histplot(card_owners['client_age'], bins=age_brackets, stat='percent', kde=False, color='blue', edgecolor='white', linewidth=1, label='Card Owners', alpha=0.5)
sns.histplot(non_card_owners['client_age'], bins=age_brackets, stat='percent', kde=False, color='orange', edgecolor='white', linewidth=1, label='Non-Card Owners', alpha=0.5)

for age_bracket in age_brackets:
    plt.axvline(age_bracket, color='black', linestyle='--', linewidth=1)
    
plt.title('Age Distribution of Card Owners and Non-Card Owners')
plt.xlabel('Age')
plt.ylabel('Percentage of Clients')

for i in range(len(age_brackets) - 1):
    plt.text(age_brackets[i] + 3, 5, f'{age_brackets[i]}-{age_brackets[i+1]-1}', fontsize=8, color='black')
    
plt.legend()
plt.grid(True)

plt.show()
No description has been provided for this image

To put customers into age group, mostly 10-year bins were defined up until age 70 where just one age group remains to represent the age group with the least cards per customer.

In [ ]:
client_df['age_bracket'] = pd.cut(client_df['client_age'], bins=age_brackets, right=False)

age_bracket_counts = client_df['age_bracket'].value_counts().reset_index()
age_bracket_counts.columns = ['Age Bracket', 'Number of Clients']

age_bracket_counts
Out[ ]:
Age Bracket Number of Clients
0 [50, 60) 767
1 [40, 50) 732
2 [30, 40) 729
3 [20, 30) 611
4 [60, 70) 460
5 [70, 100) 416
6 [0, 20) 0

Age at loan¶

Another demographic or age-related metric of use may be the age a customer had when they took out a loan.

In [ ]:
client_df['client_age_at_loan'] = (pd.to_datetime(client_df['loan_date']) - pd.to_datetime(client_df['client_birth_date'])).dt.days // 365

client_df.head(5)
Out[ ]:
disp_id client_id account_id account_district_id account_frequency account_date client_district_id client_gender client_birth_date client_age ordertype_HOUSEHOLD ordertype_INSURANCE PAYMENT ordertype_LEASING ordertype_LOAN PAYMENT ordertype_MISSING loan_id loan_date loan_amount loan_duration loan_payments loan_status card_id card_type card_issued district_id_x client_district_district_name client_district_region client_district_population client_district_n_municipalities_with_inhabitants_lt_499 client_district_n_municipalities_with_inhabitants_500_to_1999 client_district_n_municipalities_with_inhabitants_2000_to_9999 client_district_n_municipalities_with_inhabitants_gt_10000 client_district_n_cities client_district_ratio_urban_inhabitants client_district_average_salary client_district_unemployment_rate_95 client_district_unemployment_rate_96 client_district_enterpreneurs_per_1000_inhabitants client_district_n_commited_crimes_95 client_district_n_commited_crimes_96 district_id_y account_district_district_name account_district_region account_district_population account_district_n_municipalities_with_inhabitants_lt_499 account_district_n_municipalities_with_inhabitants_500_to_1999 account_district_n_municipalities_with_inhabitants_2000_to_9999 account_district_n_municipalities_with_inhabitants_gt_10000 account_district_n_cities account_district_ratio_urban_inhabitants account_district_average_salary account_district_unemployment_rate_95 account_district_unemployment_rate_96 account_district_enterpreneurs_per_1000_inhabitants account_district_n_commited_crimes_95 account_district_n_commited_crimes_96 has_card volume_month_diff_1 volume_month_diff_2 volume_month_diff_3 volume_month_diff_4 volume_month_diff_5 volume_month_diff_6 volume_month_diff_7 volume_month_diff_8 volume_month_diff_9 volume_month_diff_10 volume_month_diff_11 volume_month_diff_12 volume_month_diff_13 withdrawal_month_diff_1 withdrawal_month_diff_2 withdrawal_month_diff_3 withdrawal_month_diff_4 withdrawal_month_diff_5 withdrawal_month_diff_6 withdrawal_month_diff_7 withdrawal_month_diff_8 withdrawal_month_diff_9 withdrawal_month_diff_10 withdrawal_month_diff_11 withdrawal_month_diff_12 withdrawal_month_diff_13 credit_month_diff_1 credit_month_diff_2 credit_month_diff_3 credit_month_diff_4 credit_month_diff_5 credit_month_diff_6 credit_month_diff_7 credit_month_diff_8 credit_month_diff_9 credit_month_diff_10 credit_month_diff_11 credit_month_diff_12 credit_month_diff_13 n_transactions_month_diff_1 n_transactions_month_diff_2 n_transactions_month_diff_3 n_transactions_month_diff_4 n_transactions_month_diff_5 n_transactions_month_diff_6 n_transactions_month_diff_7 n_transactions_month_diff_8 n_transactions_month_diff_9 n_transactions_month_diff_10 n_transactions_month_diff_11 n_transactions_month_diff_12 n_transactions_month_diff_13 balance_month_diff_1 balance_month_diff_2 balance_month_diff_3 balance_month_diff_4 balance_month_diff_5 balance_month_diff_6 balance_month_diff_7 balance_month_diff_8 balance_month_diff_9 balance_month_diff_10 balance_month_diff_11 balance_month_diff_12 balance_month_diff_13 age_bracket client_age_at_loan
0 1 1 1 18 MONTHLY CHARGES 1995-03-24 18 FEMALE 1970-12-13 29 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-09-18 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 False 6492.7 6221.0 6667.1 6212.3 7435.5 7018.6 6701.9 9091.5 10907.2 7318.0 6218.0 6600.6 8282.7 2736.6 2466.6 2916.6 2466.6 3696.6 3276.6 2466.6 5356.6 7166.6 3566.6 2466.6 2856.6 4536.6 3756.1 3754.4 3750.5 3745.7 3738.9 3742.0 4235.3 3734.9 3740.6 3751.4 3751.4 3744.0 3746.1 5.0 4.0 5.0 4.0 6.0 5.0 5.0 10.0 5.0 5.0 4.0 5.0 6.0 338855.2 332362.5 326141.5 319474.4 313262.1 305826.6 298808.0 292106.1 283014.6 272107.4 264789.4 258571.4 251970.8 [20, 30) NaN
1 2 2 2 1 MONTHLY CHARGES 1993-02-26 1 MALE 1945-02-04 54 1.0 0.0 0.0 1.0 0.0 4959.0 1994-01-05 80952.0 24.0 3373.0 contract finished, no problems NaN NaN 1993-11-29 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 False 34617.6 45943.4 52856.1 47098.6 52913.9 37980.7 31345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 14280.6 25580.6 11180.6 13514.6 22400.0 17600.0 11000.0 0.0 0.0 0.0 0.0 0.0 0.0 20337.0 20362.8 20389.5 20439.0 30513.9 20380.7 20345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 6.0 7.0 7.0 6.0 3.0 3.0 3.0 3.0 1.0 0.0 0.0 0.0 0.0 327805.3 293187.7 247244.3 194388.2 147289.6 94375.7 56395.0 25049.5 1100.0 0.0 0.0 0.0 0.0 [50, 60) 48.0
2 6 6 4 12 MONTHLY CHARGES 1996-02-21 12 MALE 1919-09-22 80 2.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1997-10-26 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 False 10917.6 12013.8 9011.7 14254.5 11343.0 9015.0 10769.2 9003.4 20249.4 13529.7 9040.6 9031.2 12329.5 5277.6 6377.6 3377.6 8627.6 5697.6 3377.6 5137.6 3377.6 14607.6 7867.6 3377.6 3377.6 6677.6 5640.0 5636.2 5634.1 5626.9 5645.4 5637.4 5631.6 5625.8 5641.8 5662.1 5663.0 5653.6 5651.9 6.0 6.0 5.0 6.0 7.0 5.0 6.0 5.0 12.0 7.0 5.0 5.0 6.0 195475.7 184558.1 172544.3 163532.6 149278.1 137935.1 128920.1 118150.9 109147.5 88898.1 75368.4 66327.8 57296.6 [70, 100) NaN
3 7 7 5 15 MONTHLY CHARGES 1997-05-30 15 MALE 1929-01-25 70 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-04-27 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 False 8607.1 7797.2 15720.3 10825.1 7812.5 5035.1 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 3482.6 2682.6 10602.6 5692.6 2682.6 14.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5124.5 5114.6 5117.7 5132.5 5129.9 5020.5 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 5.0 4.0 10.0 6.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 76465.3 67858.2 60061.0 44340.7 33515.6 25703.1 20668.0 15651.0 10634.0 5617.0 600.0 0.0 0.0 [70, 100) NaN
4 8 8 6 51 MONTHLY CHARGES 1994-09-27 51 FEMALE 1938-02-21 61 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-11-27 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 False 14333.0 10810.0 10798.3 11989.9 20888.8 13055.1 10807.0 13039.2 10792.4 23857.9 11865.2 10815.6 10803.9 7488.6 3968.6 3968.6 5168.6 14048.6 6208.6 3968.6 6208.6 3968.6 17018.6 5008.6 3968.6 3968.6 6844.4 6841.4 6829.7 6821.3 6840.2 6846.5 6838.4 6830.6 6823.8 6839.3 6856.6 6847.0 6835.3 5.0 4.0 4.0 5.0 6.0 5.0 4.0 5.0 4.0 11.0 6.0 4.0 4.0 625481.9 611148.9 600338.9 589540.6 577550.7 556661.9 543606.8 532799.8 519760.6 508968.2 485110.3 473245.1 462429.5 [60, 70) NaN

Account age¶

Another age specific feature that can be derived is the account age. To calculate the account age the key date 2000-01-01 was defined to use for account age calculation.

In [ ]:
client_df['account_age'] = (pd.to_datetime('2000-01-01') - pd.to_datetime(client_df['account_date'])).dt.days // 365

client_df.head(5)
Out[ ]:
disp_id client_id account_id account_district_id account_frequency account_date client_district_id client_gender client_birth_date client_age ordertype_HOUSEHOLD ordertype_INSURANCE PAYMENT ordertype_LEASING ordertype_LOAN PAYMENT ordertype_MISSING loan_id loan_date loan_amount loan_duration loan_payments loan_status card_id card_type card_issued district_id_x client_district_district_name client_district_region client_district_population client_district_n_municipalities_with_inhabitants_lt_499 client_district_n_municipalities_with_inhabitants_500_to_1999 client_district_n_municipalities_with_inhabitants_2000_to_9999 client_district_n_municipalities_with_inhabitants_gt_10000 client_district_n_cities client_district_ratio_urban_inhabitants client_district_average_salary client_district_unemployment_rate_95 client_district_unemployment_rate_96 client_district_enterpreneurs_per_1000_inhabitants client_district_n_commited_crimes_95 client_district_n_commited_crimes_96 district_id_y account_district_district_name account_district_region account_district_population account_district_n_municipalities_with_inhabitants_lt_499 account_district_n_municipalities_with_inhabitants_500_to_1999 account_district_n_municipalities_with_inhabitants_2000_to_9999 account_district_n_municipalities_with_inhabitants_gt_10000 account_district_n_cities account_district_ratio_urban_inhabitants account_district_average_salary account_district_unemployment_rate_95 account_district_unemployment_rate_96 account_district_enterpreneurs_per_1000_inhabitants account_district_n_commited_crimes_95 account_district_n_commited_crimes_96 has_card volume_month_diff_1 volume_month_diff_2 volume_month_diff_3 volume_month_diff_4 volume_month_diff_5 volume_month_diff_6 volume_month_diff_7 volume_month_diff_8 volume_month_diff_9 volume_month_diff_10 volume_month_diff_11 volume_month_diff_12 volume_month_diff_13 withdrawal_month_diff_1 withdrawal_month_diff_2 withdrawal_month_diff_3 withdrawal_month_diff_4 withdrawal_month_diff_5 withdrawal_month_diff_6 withdrawal_month_diff_7 withdrawal_month_diff_8 withdrawal_month_diff_9 withdrawal_month_diff_10 withdrawal_month_diff_11 withdrawal_month_diff_12 withdrawal_month_diff_13 credit_month_diff_1 credit_month_diff_2 credit_month_diff_3 credit_month_diff_4 credit_month_diff_5 credit_month_diff_6 credit_month_diff_7 credit_month_diff_8 credit_month_diff_9 credit_month_diff_10 credit_month_diff_11 credit_month_diff_12 credit_month_diff_13 n_transactions_month_diff_1 n_transactions_month_diff_2 n_transactions_month_diff_3 n_transactions_month_diff_4 n_transactions_month_diff_5 n_transactions_month_diff_6 n_transactions_month_diff_7 n_transactions_month_diff_8 n_transactions_month_diff_9 n_transactions_month_diff_10 n_transactions_month_diff_11 n_transactions_month_diff_12 n_transactions_month_diff_13 balance_month_diff_1 balance_month_diff_2 balance_month_diff_3 balance_month_diff_4 balance_month_diff_5 balance_month_diff_6 balance_month_diff_7 balance_month_diff_8 balance_month_diff_9 balance_month_diff_10 balance_month_diff_11 balance_month_diff_12 balance_month_diff_13 age_bracket client_age_at_loan account_age
0 1 1 1 18 MONTHLY CHARGES 1995-03-24 18 FEMALE 1970-12-13 29 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-09-18 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 False 6492.7 6221.0 6667.1 6212.3 7435.5 7018.6 6701.9 9091.5 10907.2 7318.0 6218.0 6600.6 8282.7 2736.6 2466.6 2916.6 2466.6 3696.6 3276.6 2466.6 5356.6 7166.6 3566.6 2466.6 2856.6 4536.6 3756.1 3754.4 3750.5 3745.7 3738.9 3742.0 4235.3 3734.9 3740.6 3751.4 3751.4 3744.0 3746.1 5.0 4.0 5.0 4.0 6.0 5.0 5.0 10.0 5.0 5.0 4.0 5.0 6.0 338855.2 332362.5 326141.5 319474.4 313262.1 305826.6 298808.0 292106.1 283014.6 272107.4 264789.4 258571.4 251970.8 [20, 30) NaN 4
1 2 2 2 1 MONTHLY CHARGES 1993-02-26 1 MALE 1945-02-04 54 1.0 0.0 0.0 1.0 0.0 4959.0 1994-01-05 80952.0 24.0 3373.0 contract finished, no problems NaN NaN 1993-11-29 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 False 34617.6 45943.4 52856.1 47098.6 52913.9 37980.7 31345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 14280.6 25580.6 11180.6 13514.6 22400.0 17600.0 11000.0 0.0 0.0 0.0 0.0 0.0 0.0 20337.0 20362.8 20389.5 20439.0 30513.9 20380.7 20345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 6.0 7.0 7.0 6.0 3.0 3.0 3.0 3.0 1.0 0.0 0.0 0.0 0.0 327805.3 293187.7 247244.3 194388.2 147289.6 94375.7 56395.0 25049.5 1100.0 0.0 0.0 0.0 0.0 [50, 60) 48.0 6
2 6 6 4 12 MONTHLY CHARGES 1996-02-21 12 MALE 1919-09-22 80 2.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1997-10-26 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 False 10917.6 12013.8 9011.7 14254.5 11343.0 9015.0 10769.2 9003.4 20249.4 13529.7 9040.6 9031.2 12329.5 5277.6 6377.6 3377.6 8627.6 5697.6 3377.6 5137.6 3377.6 14607.6 7867.6 3377.6 3377.6 6677.6 5640.0 5636.2 5634.1 5626.9 5645.4 5637.4 5631.6 5625.8 5641.8 5662.1 5663.0 5653.6 5651.9 6.0 6.0 5.0 6.0 7.0 5.0 6.0 5.0 12.0 7.0 5.0 5.0 6.0 195475.7 184558.1 172544.3 163532.6 149278.1 137935.1 128920.1 118150.9 109147.5 88898.1 75368.4 66327.8 57296.6 [70, 100) NaN 3
3 7 7 5 15 MONTHLY CHARGES 1997-05-30 15 MALE 1929-01-25 70 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-04-27 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 False 8607.1 7797.2 15720.3 10825.1 7812.5 5035.1 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 3482.6 2682.6 10602.6 5692.6 2682.6 14.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5124.5 5114.6 5117.7 5132.5 5129.9 5020.5 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 5.0 4.0 10.0 6.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 76465.3 67858.2 60061.0 44340.7 33515.6 25703.1 20668.0 15651.0 10634.0 5617.0 600.0 0.0 0.0 [70, 100) NaN 2
4 8 8 6 51 MONTHLY CHARGES 1994-09-27 51 FEMALE 1938-02-21 61 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-11-27 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 False 14333.0 10810.0 10798.3 11989.9 20888.8 13055.1 10807.0 13039.2 10792.4 23857.9 11865.2 10815.6 10803.9 7488.6 3968.6 3968.6 5168.6 14048.6 6208.6 3968.6 6208.6 3968.6 17018.6 5008.6 3968.6 3968.6 6844.4 6841.4 6829.7 6821.3 6840.2 6846.5 6838.4 6830.6 6823.8 6839.3 6856.6 6847.0 6835.3 5.0 4.0 4.0 5.0 6.0 5.0 4.0 5.0 4.0 11.0 6.0 4.0 4.0 625481.9 611148.9 600338.9 589540.6 577550.7 556661.9 543606.8 532799.8 519760.6 508968.2 485110.3 473245.1 462429.5 [60, 70) NaN 5

Financial Features¶

Average Monthly Payment Ratio¶

A good marker of credit risk might be the ratio of monthly loan payment to average salary in the account's district. By comparing the client's loan payments to the average salary in their district, the bank can evaluate the affordability of the loan for the client. A high ratio indicates that the loan payments take up a significant portion of the average salary, which could suggest potential difficulties in loan or credit repayment.

In [ ]:
client_df['monthly_payment_ratio'] = client_df['loan_payments'] / client_df['account_district_average_salary']

District Features¶

Client-Account Region Match¶

Since accounts and clients can have two distinct regions which they are registered in it, the region_match feature will capture if both regions match and a client has direct access to the bank through the same respective region.

In [ ]:
client_df['region_match'] = client_df['client_district_region'] == client_df['account_district_region']

District Population Density¶

Client District Population Density:

  • Population density can be correlated with economic activity and employment opportunities. High-density areas might have more diverse job opportunities, potentially reducing credit risk.

Account District Population Density:

  • Understanding the economic vibrancy of the district where the account is held can provide insights into the stability and reliability of account transactions and financial behaviors.
In [ ]:
client_df['client_district_population_density'] = client_df['client_district_population'] / client_df['client_district_n_cities']
client_df['account_district_population_density'] = client_df['account_district_population'] / client_df['account_district_n_cities']

Event-Based Statistics¶

To capture some behavioral metrics on each client's spending behavior, the following block will add some statistical metrics on the volume, credit, balance, withdrawal and number of transactions before the issue date.

In [ ]:
transactions_monthly['card_issued'] = pd.to_datetime(transactions_monthly['card_issued'])

transactions_monthly['month'] = transactions_monthly['month'].apply(lambda x: x.to_timestamp() if isinstance(x, pd.Period) else x)

transactions_filtered = transactions_monthly[transactions_monthly['month'] < transactions_monthly['card_issued']]

stat_columns = ['volume', 'credit', 'withdrawal', 'n_transactions', 'balance']

stats = transactions_filtered.groupby('account_id')[stat_columns].agg(['mean', 'std', 'min', 'max', 'sum', 'median']).reset_index()

stats.columns = ['account_id'] + [f"pre_card_{col[0]}_{col[1]}" for col in stats.columns[1:]]

client_df = client_df.merge(stats, on='account_id', how='left')

client_df.head(5)
Out[ ]:
disp_id client_id account_id account_district_id account_frequency account_date client_district_id client_gender client_birth_date client_age ordertype_HOUSEHOLD ordertype_INSURANCE PAYMENT ordertype_LEASING ordertype_LOAN PAYMENT ordertype_MISSING loan_id loan_date loan_amount loan_duration loan_payments loan_status card_id card_type card_issued district_id_x client_district_district_name client_district_region client_district_population client_district_n_municipalities_with_inhabitants_lt_499 client_district_n_municipalities_with_inhabitants_500_to_1999 client_district_n_municipalities_with_inhabitants_2000_to_9999 client_district_n_municipalities_with_inhabitants_gt_10000 client_district_n_cities client_district_ratio_urban_inhabitants client_district_average_salary client_district_unemployment_rate_95 client_district_unemployment_rate_96 client_district_enterpreneurs_per_1000_inhabitants client_district_n_commited_crimes_95 client_district_n_commited_crimes_96 district_id_y account_district_district_name account_district_region account_district_population account_district_n_municipalities_with_inhabitants_lt_499 account_district_n_municipalities_with_inhabitants_500_to_1999 account_district_n_municipalities_with_inhabitants_2000_to_9999 account_district_n_municipalities_with_inhabitants_gt_10000 account_district_n_cities account_district_ratio_urban_inhabitants account_district_average_salary account_district_unemployment_rate_95 account_district_unemployment_rate_96 account_district_enterpreneurs_per_1000_inhabitants account_district_n_commited_crimes_95 account_district_n_commited_crimes_96 has_card volume_month_diff_1 volume_month_diff_2 volume_month_diff_3 volume_month_diff_4 volume_month_diff_5 volume_month_diff_6 volume_month_diff_7 volume_month_diff_8 volume_month_diff_9 volume_month_diff_10 volume_month_diff_11 volume_month_diff_12 volume_month_diff_13 withdrawal_month_diff_1 withdrawal_month_diff_2 withdrawal_month_diff_3 withdrawal_month_diff_4 withdrawal_month_diff_5 withdrawal_month_diff_6 withdrawal_month_diff_7 withdrawal_month_diff_8 withdrawal_month_diff_9 withdrawal_month_diff_10 withdrawal_month_diff_11 withdrawal_month_diff_12 withdrawal_month_diff_13 credit_month_diff_1 credit_month_diff_2 credit_month_diff_3 credit_month_diff_4 credit_month_diff_5 credit_month_diff_6 credit_month_diff_7 credit_month_diff_8 credit_month_diff_9 credit_month_diff_10 credit_month_diff_11 credit_month_diff_12 credit_month_diff_13 n_transactions_month_diff_1 n_transactions_month_diff_2 n_transactions_month_diff_3 n_transactions_month_diff_4 n_transactions_month_diff_5 n_transactions_month_diff_6 n_transactions_month_diff_7 n_transactions_month_diff_8 n_transactions_month_diff_9 n_transactions_month_diff_10 n_transactions_month_diff_11 n_transactions_month_diff_12 n_transactions_month_diff_13 balance_month_diff_1 balance_month_diff_2 balance_month_diff_3 balance_month_diff_4 balance_month_diff_5 balance_month_diff_6 balance_month_diff_7 balance_month_diff_8 balance_month_diff_9 balance_month_diff_10 balance_month_diff_11 balance_month_diff_12 balance_month_diff_13 age_bracket client_age_at_loan account_age monthly_payment_ratio region_match client_district_population_density account_district_population_density pre_card_volume_mean pre_card_volume_std pre_card_volume_min pre_card_volume_max pre_card_volume_sum pre_card_volume_median pre_card_credit_mean pre_card_credit_std pre_card_credit_min pre_card_credit_max pre_card_credit_sum pre_card_credit_median pre_card_withdrawal_mean pre_card_withdrawal_std pre_card_withdrawal_min pre_card_withdrawal_max pre_card_withdrawal_sum pre_card_withdrawal_median pre_card_n_transactions_mean pre_card_n_transactions_std pre_card_n_transactions_min pre_card_n_transactions_max pre_card_n_transactions_sum pre_card_n_transactions_median pre_card_balance_mean pre_card_balance_std pre_card_balance_min pre_card_balance_max pre_card_balance_sum pre_card_balance_median
0 1 1 1 18 MONTHLY CHARGES 1995-03-24 18 FEMALE 1970-12-13 29 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-09-18 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 False 6492.7 6221.0 6667.1 6212.3 7435.5 7018.6 6701.9 9091.5 10907.2 7318.0 6218.0 6600.6 8282.7 2736.6 2466.6 2916.6 2466.6 3696.6 3276.6 2466.6 5356.6 7166.6 3566.6 2466.6 2856.6 4536.6 3756.1 3754.4 3750.5 3745.7 3738.9 3742.0 4235.3 3734.9 3740.6 3751.4 3751.4 3744.0 3746.1 5.0 4.0 5.0 4.0 6.0 5.0 5.0 10.0 5.0 5.0 4.0 5.0 6.0 338855.2 332362.5 326141.5 319474.4 313262.1 305826.6 298808.0 292106.1 283014.6 272107.4 264789.4 258571.4 251970.8 [20, 30) NaN 4 NaN True 1.767475e+04 1.767475e+04 8175.802326 2502.492654 1000.0 16298.2 351559.5 7971.10 4258.155814 2034.569156 1000.0 16298.2 183100.7 3752.20 3917.646512 2108.391594 0.0 8966.6 168458.8 3636.6 5.255814 1.839941 1 12 226 5.0 181993.216279 104473.501441 1000.0 351559.5 7825708.3 186600.30
1 2 2 2 1 MONTHLY CHARGES 1993-02-26 1 MALE 1945-02-04 54 1.0 0.0 0.0 1.0 0.0 4959.0 1994-01-05 80952.0 24.0 3373.0 contract finished, no problems NaN NaN 1993-11-29 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 False 34617.6 45943.4 52856.1 47098.6 52913.9 37980.7 31345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 14280.6 25580.6 11180.6 13514.6 22400.0 17600.0 11000.0 0.0 0.0 0.0 0.0 0.0 0.0 20337.0 20362.8 20389.5 20439.0 30513.9 20380.7 20345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 6.0 7.0 7.0 6.0 3.0 3.0 3.0 3.0 1.0 0.0 0.0 0.0 0.0 327805.3 293187.7 247244.3 194388.2 147289.6 94375.7 56395.0 25049.5 1100.0 0.0 0.0 0.0 0.0 [50, 60) 48.0 6 0.268958 True 1.204953e+06 1.204953e+06 36456.100000 15568.744255 1100.0 52913.9 364561.0 37368.20 19819.300000 7333.475343 1100.0 30513.9 198193.0 20377.90 13193.700000 8340.395153 0.0 25580.6 131937.0 13897.6 4.500000 2.121320 1 7 45 4.5 175139.630000 130344.741006 1100.0 364561.0 1751396.3 170838.90
2 6 6 4 12 MONTHLY CHARGES 1996-02-21 12 MALE 1919-09-22 80 2.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1997-10-26 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 False 10917.6 12013.8 9011.7 14254.5 11343.0 9015.0 10769.2 9003.4 20249.4 13529.7 9040.6 9031.2 12329.5 5277.6 6377.6 3377.6 8627.6 5697.6 3377.6 5137.6 3377.6 14607.6 7867.6 3377.6 3377.6 6677.6 5640.0 5636.2 5634.1 5626.9 5645.4 5637.4 5631.6 5625.8 5641.8 5662.1 5663.0 5653.6 5651.9 6.0 6.0 5.0 6.0 7.0 5.0 6.0 5.0 12.0 7.0 5.0 5.0 6.0 195475.7 184558.1 172544.3 163532.6 149278.1 137935.1 128920.1 118150.9 109147.5 88898.1 75368.4 66327.8 57296.6 [70, 100) NaN 3 NaN True 1.797833e+04 1.797833e+04 9799.609524 4045.715298 800.0 20249.4 205791.8 9040.60 5394.200000 1053.321771 800.0 5663.0 113278.2 5636.20 4405.409524 3600.026750 0.0 14607.6 92513.6 3377.6 4.904762 2.681506 1 12 103 5.0 94740.638095 67737.277268 800.0 205791.8 1989553.4 88898.10
3 7 7 5 15 MONTHLY CHARGES 1997-05-30 15 MALE 1929-01-25 70 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-04-27 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 False 8607.1 7797.2 15720.3 10825.1 7812.5 5035.1 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 3482.6 2682.6 10602.6 5692.6 2682.6 14.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5124.5 5114.6 5117.7 5132.5 5129.9 5020.5 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 5.0 4.0 10.0 6.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 76465.3 67858.2 60061.0 44340.7 33515.6 25703.1 20668.0 15651.0 10634.0 5617.0 600.0 0.0 0.0 [70, 100) NaN 2 NaN True 1.175920e+04 1.175920e+04 7206.491667 3857.313687 600.0 15720.3 86477.9 6416.15 4703.141667 1293.265591 600.0 5132.5 56437.7 5067.55 2503.350000 3302.628031 0.0 10602.6 30040.2 1348.6 3.500000 2.779797 1 10 42 3.5 37299.316667 29247.711915 600.0 86477.9 447591.8 29609.35
4 8 8 6 51 MONTHLY CHARGES 1994-09-27 51 FEMALE 1938-02-21 61 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-11-27 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 False 14333.0 10810.0 10798.3 11989.9 20888.8 13055.1 10807.0 13039.2 10792.4 23857.9 11865.2 10815.6 10803.9 7488.6 3968.6 3968.6 5168.6 14048.6 6208.6 3968.6 6208.6 3968.6 17018.6 5008.6 3968.6 3968.6 6844.4 6841.4 6829.7 6821.3 6840.2 6846.5 6838.4 6830.6 6823.8 6839.3 6856.6 6847.0 6835.3 5.0 4.0 4.0 5.0 6.0 5.0 4.0 5.0 4.0 11.0 6.0 4.0 4.0 625481.9 611148.9 600338.9 589540.6 577550.7 556661.9 543606.8 532799.8 519760.6 508968.2 485110.3 473245.1 462429.5 [60, 70) NaN 5 NaN True 1.108609e+04 1.108609e+04 12476.490196 4223.409252 900.0 23857.9 636301.0 10974.00 6681.478431 826.982305 900.0 6856.6 340755.4 6804.60 5795.011765 3958.074919 0.0 17018.6 295545.6 4168.6 4.745098 2.133946 1 12 242 5.0 309015.403922 195306.573555 900.0 636301.0 15759785.6 307532.30

The same metrics can also be calculated for the rollup window only, without going back the entire history of each customer:

In [ ]:
month_diff_columns = [col for col in client_df.columns if 'month_diff' in col]

prefixes = set(col.rsplit('_', 2)[0] for col in month_diff_columns)

for prefix in prefixes:
    cols = [col for col in month_diff_columns if col.startswith(prefix)]
    client_df[f'{prefix}_mean'] = client_df[cols].mean(axis=1)
    client_df[f'{prefix}_std'] = client_df[cols].std(axis=1)
    client_df[f'{prefix}_min'] = client_df[cols].min(axis=1)
    client_df[f'{prefix}_max'] = client_df[cols].max(axis=1)
    client_df[f'{prefix}_sum'] = client_df[cols].sum(axis=1)
    client_df[f'{prefix}_median'] = client_df[cols].median(axis=1)
    
client_df.head(5)
Out[ ]:
disp_id client_id account_id account_district_id account_frequency account_date client_district_id client_gender client_birth_date client_age ordertype_HOUSEHOLD ordertype_INSURANCE PAYMENT ordertype_LEASING ordertype_LOAN PAYMENT ordertype_MISSING loan_id loan_date loan_amount loan_duration loan_payments loan_status card_id card_type card_issued district_id_x client_district_district_name client_district_region client_district_population client_district_n_municipalities_with_inhabitants_lt_499 client_district_n_municipalities_with_inhabitants_500_to_1999 client_district_n_municipalities_with_inhabitants_2000_to_9999 client_district_n_municipalities_with_inhabitants_gt_10000 client_district_n_cities client_district_ratio_urban_inhabitants client_district_average_salary client_district_unemployment_rate_95 client_district_unemployment_rate_96 client_district_enterpreneurs_per_1000_inhabitants client_district_n_commited_crimes_95 client_district_n_commited_crimes_96 district_id_y account_district_district_name account_district_region account_district_population account_district_n_municipalities_with_inhabitants_lt_499 account_district_n_municipalities_with_inhabitants_500_to_1999 account_district_n_municipalities_with_inhabitants_2000_to_9999 account_district_n_municipalities_with_inhabitants_gt_10000 account_district_n_cities account_district_ratio_urban_inhabitants account_district_average_salary account_district_unemployment_rate_95 account_district_unemployment_rate_96 account_district_enterpreneurs_per_1000_inhabitants account_district_n_commited_crimes_95 account_district_n_commited_crimes_96 has_card volume_month_diff_1 volume_month_diff_2 volume_month_diff_3 volume_month_diff_4 volume_month_diff_5 volume_month_diff_6 volume_month_diff_7 volume_month_diff_8 volume_month_diff_9 volume_month_diff_10 volume_month_diff_11 volume_month_diff_12 volume_month_diff_13 withdrawal_month_diff_1 withdrawal_month_diff_2 withdrawal_month_diff_3 withdrawal_month_diff_4 withdrawal_month_diff_5 withdrawal_month_diff_6 withdrawal_month_diff_7 withdrawal_month_diff_8 withdrawal_month_diff_9 withdrawal_month_diff_10 withdrawal_month_diff_11 withdrawal_month_diff_12 withdrawal_month_diff_13 credit_month_diff_1 credit_month_diff_2 credit_month_diff_3 credit_month_diff_4 credit_month_diff_5 credit_month_diff_6 credit_month_diff_7 credit_month_diff_8 credit_month_diff_9 credit_month_diff_10 credit_month_diff_11 credit_month_diff_12 credit_month_diff_13 n_transactions_month_diff_1 n_transactions_month_diff_2 n_transactions_month_diff_3 n_transactions_month_diff_4 n_transactions_month_diff_5 n_transactions_month_diff_6 n_transactions_month_diff_7 n_transactions_month_diff_8 n_transactions_month_diff_9 n_transactions_month_diff_10 n_transactions_month_diff_11 n_transactions_month_diff_12 n_transactions_month_diff_13 balance_month_diff_1 balance_month_diff_2 balance_month_diff_3 balance_month_diff_4 balance_month_diff_5 balance_month_diff_6 balance_month_diff_7 balance_month_diff_8 balance_month_diff_9 balance_month_diff_10 balance_month_diff_11 balance_month_diff_12 balance_month_diff_13 age_bracket client_age_at_loan account_age monthly_payment_ratio region_match client_district_population_density account_district_population_density pre_card_volume_mean pre_card_volume_std pre_card_volume_min pre_card_volume_max pre_card_volume_sum pre_card_volume_median pre_card_credit_mean pre_card_credit_std pre_card_credit_min pre_card_credit_max pre_card_credit_sum pre_card_credit_median pre_card_withdrawal_mean pre_card_withdrawal_std pre_card_withdrawal_min pre_card_withdrawal_max pre_card_withdrawal_sum pre_card_withdrawal_median pre_card_n_transactions_mean pre_card_n_transactions_std pre_card_n_transactions_min pre_card_n_transactions_max pre_card_n_transactions_sum pre_card_n_transactions_median pre_card_balance_mean pre_card_balance_std pre_card_balance_min pre_card_balance_max pre_card_balance_sum pre_card_balance_median n_transactions_month_mean n_transactions_month_std n_transactions_month_min n_transactions_month_max n_transactions_month_sum n_transactions_month_median credit_month_mean credit_month_std credit_month_min credit_month_max credit_month_sum credit_month_median withdrawal_month_mean withdrawal_month_std withdrawal_month_min withdrawal_month_max withdrawal_month_sum withdrawal_month_median volume_month_mean volume_month_std volume_month_min volume_month_max volume_month_sum volume_month_median balance_month_mean balance_month_std balance_month_min balance_month_max balance_month_sum balance_month_median
0 1 1 1 18 MONTHLY CHARGES 1995-03-24 18 FEMALE 1970-12-13 29 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-09-18 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 18 Pisek south Bohemia 70699 60 13 2 1 4 65.3 8968 2.83 3.35 131 1740 1910 False 6492.7 6221.0 6667.1 6212.3 7435.5 7018.6 6701.9 9091.5 10907.2 7318.0 6218.0 6600.6 8282.7 2736.6 2466.6 2916.6 2466.6 3696.6 3276.6 2466.6 5356.6 7166.6 3566.6 2466.6 2856.6 4536.6 3756.1 3754.4 3750.5 3745.7 3738.9 3742.0 4235.3 3734.9 3740.6 3751.4 3751.4 3744.0 3746.1 5.0 4.0 5.0 4.0 6.0 5.0 5.0 10.0 5.0 5.0 4.0 5.0 6.0 338855.2 332362.5 326141.5 319474.4 313262.1 305826.6 298808.0 292106.1 283014.6 272107.4 264789.4 258571.4 251970.8 [20, 30) NaN 4 NaN True 1.767475e+04 1.767475e+04 8175.802326 2502.492654 1000.0 16298.2 351559.5 7971.10 4258.155814 2034.569156 1000.0 16298.2 183100.7 3752.20 3917.646512 2108.391594 0.0 8966.6 168458.8 3636.6 5.255814 1.839941 1 12 226 5.0 181993.216279 104473.501441 1000.0 351559.5 7825708.3 186600.30 5.307692 1.548366 4.0 10.0 69.0 5.0 3783.946154 135.760105 3734.9 4235.3 49191.3 3746.1 3536.600000 1403.857780 2466.6 7166.6 45975.8 2916.6 7320.546154 1375.278204 6212.3 10907.2 95167.1 6701.9 296714.615385 28986.141531 251970.8 338855.2 3857290.0 298808.0
1 2 2 2 1 MONTHLY CHARGES 1993-02-26 1 MALE 1945-02-04 54 1.0 0.0 0.0 1.0 0.0 4959.0 1994-01-05 80952.0 24.0 3373.0 contract finished, no problems NaN NaN 1993-11-29 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 1 Hl.m. Praha Prague 1204953 0 0 0 1 1 100.0 12541 0.29 0.43 167 85677 99107 False 34617.6 45943.4 52856.1 47098.6 52913.9 37980.7 31345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 14280.6 25580.6 11180.6 13514.6 22400.0 17600.0 11000.0 0.0 0.0 0.0 0.0 0.0 0.0 20337.0 20362.8 20389.5 20439.0 30513.9 20380.7 20345.5 23949.5 1100.0 0.0 0.0 0.0 0.0 6.0 7.0 7.0 6.0 3.0 3.0 3.0 3.0 1.0 0.0 0.0 0.0 0.0 327805.3 293187.7 247244.3 194388.2 147289.6 94375.7 56395.0 25049.5 1100.0 0.0 0.0 0.0 0.0 [50, 60) 48.0 6 0.268958 True 1.204953e+06 1.204953e+06 36456.100000 15568.744255 1100.0 52913.9 364561.0 37368.20 19819.300000 7333.475343 1100.0 30513.9 198193.0 20377.90 13193.700000 8340.395153 0.0 25580.6 131937.0 13897.6 4.500000 2.121320 1 7 45 4.5 175139.630000 130344.741006 1100.0 364561.0 1751396.3 170838.90 3.000000 2.738613 0.0 7.0 39.0 3.0 13678.300000 11418.822140 0.0 30513.9 177817.9 20345.5 8888.953846 9448.533202 0.0 25580.6 115556.4 11000.0 25215.792308 22089.010478 0.0 52913.9 327805.3 31345.5 106679.638462 122063.799031 0.0 327805.3 1386835.3 56395.0
2 6 6 4 12 MONTHLY CHARGES 1996-02-21 12 MALE 1919-09-22 80 2.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1997-10-26 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 12 Pribram central Bohemia 107870 84 29 6 1 6 58.0 8754 3.83 4.31 137 3804 3868 False 10917.6 12013.8 9011.7 14254.5 11343.0 9015.0 10769.2 9003.4 20249.4 13529.7 9040.6 9031.2 12329.5 5277.6 6377.6 3377.6 8627.6 5697.6 3377.6 5137.6 3377.6 14607.6 7867.6 3377.6 3377.6 6677.6 5640.0 5636.2 5634.1 5626.9 5645.4 5637.4 5631.6 5625.8 5641.8 5662.1 5663.0 5653.6 5651.9 6.0 6.0 5.0 6.0 7.0 5.0 6.0 5.0 12.0 7.0 5.0 5.0 6.0 195475.7 184558.1 172544.3 163532.6 149278.1 137935.1 128920.1 118150.9 109147.5 88898.1 75368.4 66327.8 57296.6 [70, 100) NaN 3 NaN True 1.797833e+04 1.797833e+04 9799.609524 4045.715298 800.0 20249.4 205791.8 9040.60 5394.200000 1053.321771 800.0 5663.0 113278.2 5636.20 4405.409524 3600.026750 0.0 14607.6 92513.6 3377.6 4.904762 2.681506 1 12 103 5.0 94740.638095 67737.277268 800.0 205791.8 1989553.4 88898.10 6.230769 1.877669 5.0 12.0 81.0 6.0 5642.292308 12.274259 5625.8 5663.0 73349.8 5640.0 5935.292308 3172.436377 3377.6 14607.6 77158.8 5277.6 11577.584615 3172.461770 9003.4 20249.4 150508.6 10917.6 126725.638462 45719.838797 57296.6 195475.7 1647433.3 128920.1
3 7 7 5 15 MONTHLY CHARGES 1997-05-30 15 MALE 1929-01-25 70 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-04-27 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 15 Cesky Krumlov south Bohemia 58796 22 16 7 1 5 51.9 9045 3.13 3.60 124 1845 1879 False 8607.1 7797.2 15720.3 10825.1 7812.5 5035.1 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 3482.6 2682.6 10602.6 5692.6 2682.6 14.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5124.5 5114.6 5117.7 5132.5 5129.9 5020.5 5017.0 5017.0 5017.0 5017.0 600.0 0.0 0.0 5.0 4.0 10.0 6.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 76465.3 67858.2 60061.0 44340.7 33515.6 25703.1 20668.0 15651.0 10634.0 5617.0 600.0 0.0 0.0 [70, 100) NaN 2 NaN True 1.175920e+04 1.175920e+04 7206.491667 3857.313687 600.0 15720.3 86477.9 6416.15 4703.141667 1293.265591 600.0 5132.5 56437.7 5067.55 2503.350000 3302.628031 0.0 10602.6 30040.2 1348.6 3.500000 2.779797 1 10 42 3.5 37299.316667 29247.711915 600.0 86477.9 447591.8 29609.35 2.846154 2.911075 0.0 10.0 37.0 1.0 3946.746154 2141.203431 0.0 5132.5 51307.7 5017.0 1935.200000 3197.098104 0.0 10602.6 25157.6 0.0 5881.946154 4442.717706 0.0 15720.3 76465.3 5017.0 27777.992308 26763.060885 0.0 76465.3 361113.9 20668.0
4 8 8 6 51 MONTHLY CHARGES 1994-09-27 51 FEMALE 1938-02-21 61 1.0 0.0 0.0 0.0 0.0 NaN NaT NaN NaN NaN NaN NaN NaN 1998-11-27 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 51 Trutnov east Bohemia 121947 37 28 7 3 11 70.5 8541 2.51 2.97 131 3496 3839 False 14333.0 10810.0 10798.3 11989.9 20888.8 13055.1 10807.0 13039.2 10792.4 23857.9 11865.2 10815.6 10803.9 7488.6 3968.6 3968.6 5168.6 14048.6 6208.6 3968.6 6208.6 3968.6 17018.6 5008.6 3968.6 3968.6 6844.4 6841.4 6829.7 6821.3 6840.2 6846.5 6838.4 6830.6 6823.8 6839.3 6856.6 6847.0 6835.3 5.0 4.0 4.0 5.0 6.0 5.0 4.0 5.0 4.0 11.0 6.0 4.0 4.0 625481.9 611148.9 600338.9 589540.6 577550.7 556661.9 543606.8 532799.8 519760.6 508968.2 485110.3 473245.1 462429.5 [60, 70) NaN 5 NaN True 1.108609e+04 1.108609e+04 12476.490196 4223.409252 900.0 23857.9 636301.0 10974.00 6681.478431 826.982305 900.0 6856.6 340755.4 6804.60 5795.011765 3958.074919 0.0 17018.6 295545.6 4168.6 4.745098 2.133946 1 12 242 5.0 309015.403922 195306.573555 900.0 636301.0 15759785.6 307532.30 5.153846 1.908147 4.0 11.0 67.0 5.0 6838.038462 9.873663 6821.3 6856.6 88894.5 6839.3 6535.523077 4195.214306 3968.6 17018.6 84961.8 5008.6 13373.561538 4196.460352 10792.4 23857.9 173856.3 11865.2 545126.400000 53800.670304 462429.5 625481.9 7086643.2 543606.8

And since we already explored that towards the end of balance and volume an increase or other type of pattern can be distinguished for card owners and non-card customers, statistics of quarterly aggregations get calculated within the rollup window.

In [ ]:
month_diff_columns = [col for col in client_df.columns if 'month_diff' in col]

prefixes = set(col.rsplit('_', 3)[0] for col in month_diff_columns)

for prefix in prefixes:
    first_quarter_cols = [f'{prefix}_month_diff_{i}' for i in range(1, 4)]
    last_quarter_cols = [f'{prefix}_month_diff_{i}' for i in range(10, 13)]
    
    client_df[f'{prefix}_first_quarter_mean'] = client_df[first_quarter_cols].mean(axis=1)
    client_df[f'{prefix}_first_quarter_std'] = client_df[first_quarter_cols].std(axis=1)
    client_df[f'{prefix}_first_quarter_min'] = client_df[first_quarter_cols].min(axis=1)
    client_df[f'{prefix}_first_quarter_max'] = client_df[first_quarter_cols].max(axis=1)
    client_df[f'{prefix}_first_quarter_sum'] = client_df[first_quarter_cols].sum(axis=1)
    client_df[f'{prefix}_first_quarter_median'] = client_df[first_quarter_cols].median(axis=1)
    
    client_df[f'{prefix}_last_quarter_mean'] = client_df[last_quarter_cols].mean(axis=1)
    client_df[f'{prefix}_last_quarter_std'] = client_df[last_quarter_cols].std(axis=1)
    client_df[f'{prefix}_last_quarter_min'] = client_df[last_quarter_cols].min(axis=1)
    client_df[f'{prefix}_last_quarter_max'] = client_df[last_quarter_cols].max(axis=1)
    client_df[f'{prefix}_last_quarter_sum'] = client_df[last_quarter_cols].sum(axis=1)
    client_df[f'{prefix}_last_quarter_median'] = client_df[last_quarter_cols].median(axis=1)

Looking at the effect of the new features¶

To make sure the features at hand are actually meaningful, it is important to look at the newly derived features.

Impact of variance¶

Having a too low variance in the data of our features might make entire features meaningless for prediction so it makes sense to look at the variances inside the variables in our record. If there are any variables with 0 variance they should be removed since they don't contain any valuable information for the modeling process.

In [ ]:
numerical_cols = client_df.select_dtypes(include=['number']).columns
numerical_variance = client_df[numerical_cols].var()

pd.DataFrame(numerical_variance.sort_values().head(50), columns=['Variance'])
Out[ ]:
Variance
monthly_payment_ratio 0.060612
ordertype_LEASING 0.081946
ordertype_INSURANCE PAYMENT 0.118096
ordertype_HOUSEHOLD 0.138574
ordertype_LOAN PAYMENT 0.152048
ordertype_MISSING 0.321142
pre_card_n_transactions_std 0.502247
n_transactions_month_std 0.780694
pre_card_n_transactions_min 0.787620
client_district_n_municipalities_with_inhabitants_gt_10000 1.158812
account_district_n_municipalities_with_inhabitants_gt_10000 1.171299
n_transactions_last_quarter_std 1.471800
n_transactions_first_quarter_std 1.482871
pre_card_n_transactions_mean 1.874582
account_age 2.189220
pre_card_n_transactions_median 2.369724
n_transactions_month_mean 4.094829
account_district_unemployment_rate_96 4.556460
client_district_unemployment_rate_96 4.596699
n_transactions_month_median 4.596916
n_transactions_first_quarter_min 4.774850
n_transactions_first_quarter_median 5.829349
n_transactions_month_min 6.133087
n_transactions_first_quarter_mean 6.188442
n_transactions_last_quarter_min 6.468688
n_transactions_last_quarter_median 7.695102
n_transactions_month_diff_5 7.749776
n_transactions_month_diff_4 7.804750
n_transactions_month_diff_7 7.808253
n_transactions_month_diff_6 7.843252
n_transactions_month_diff_1 7.892763
n_transactions_month_diff_3 8.077765
n_transactions_month_diff_8 8.182369
n_transactions_month_diff_2 8.199526
n_transactions_last_quarter_mean 8.268584
account_district_n_cities 8.520973
client_district_n_cities 8.655055
n_transactions_month_diff_9 8.989684
n_transactions_month_diff_10 9.947550
n_transactions_month_diff_12 9.953236
n_transactions_month_diff_13 9.988650
n_transactions_month_diff_11 10.113869
pre_card_n_transactions_max 10.260971
n_transactions_month_max 10.370382
n_transactions_first_quarter_max 11.245221
n_transactions_last_quarter_max 13.871260
client_district_n_municipalities_with_inhabitants_2000_to_9999 19.236550
account_district_n_municipalities_with_inhabitants_2000_to_9999 19.630565
n_transactions_first_quarter_sum 55.695982
n_transactions_last_quarter_sum 74.417258

As we can see, no column has a 0-variance so they all contain some sort of information that may be of use for the modelling process.

Looking at temporal client traits¶

To validate if the newly added statistics on the roll-up window show a significant change, the following visualizations should explore the distinctiveness between card holders and non-card customers.

In [ ]:
from src.plot_utils import plot_agg_variables

plot_agg_variables(client_df, 'volume', ['mean'])
No description has been provided for this image

There are clear and distinct trends in the mean volume for the two groups. Customers who eventually get a card show a decreasing trend in volume over the rollup time, whereas the customers who do not get a card have a relatively stable or only slightly decreasing trend compared to the other group.

The volume in card holders is overall also much higher than for non-card customers over all months in the rollup windows.

In [ ]:
plot_agg_variables(client_df, 'balance', ['mean'])
No description has been provided for this image

In comparison the balance shows less structural differences. Both customer groups show linearly decreasing balances across all customers. The customers with cards again appear to have an overall higher mean balance throughout the rollup window.

In [ ]:
plot_agg_variables(client_df, 'withdrawal', ['mean'])
No description has been provided for this image

The comparison between withdrawal for card owners and non-card customers again shows a very similar trend/structure as already observed in the plot for the volume in the rollup window. The card owners' withdrawal decreases linearly while the non-card customers' withdrawal increases during the first 3 months of the rollup window and then starts to decrease as well.

When looking at the range of values for both groups the same difference can be seen again where card owners generally have a higher mean withdrawal in the months before their card issuance.

Train-Test-Split¶

Since we now have added numerous features and finished the preprocessing process the next step will be to split the data into a train and test split before we train the proposed baseline and candidate models on the data.

To split the data a train_test_split_bal function was made. This function will randomly undersample the train split into a 50/50 split. This way it can be made sure to train the models without developing a bias towards the majority class but rather to effectively concentrate on the features to predict the card classes.

The test split however does not get undersampled, it will keep the original data's distribution since it can be assumed it represents the actual customer base. If the real-world data is imbalanced, the test set should also be imbalanced to accurately evaluate the model's performance in real-world scenarios.

Later on during training, the models go through 5-Fold Cross Validation which will inherently introduce a validation set of 1/5th of the training set.

The split ratio was set to 80/20; 80% being the the training set size and 20% the test size.

In [ ]:
from src.train_utils import train_test_split_bal

cleaned_client_df = client_df.loc[:, ~client_df.columns.str.contains('_id|id_|_date|card_issued|age_bracket|card_type')]

X_train, X_test, y_train, y_test = train_test_split_bal(df=cleaned_client_df, target_column='has_card', test_size=0.2, random_state=1337, balancing_technique='undersample')
In [ ]:
print('Train set:')
print(y_train.value_counts(normalize=True))
print('\nTest set:')
print(y_test.value_counts(normalize=True))
Train set:
has_card
False    0.5
True     0.5
Name: proportion, dtype: float64

Test set:
has_card
False    0.792732
True     0.207268
Name: proportion, dtype: float64

As intended, the class distribution is balanced on the train set but the test set's distribution stays unchanged and roughly represents the real world data.

Baseline Model¶

The baseline model should be used as a base for comparison for future model candidates.

The baseline model only models the has_card class based on the following features:

  • Client Age
  • Gender
  • Client's District Region
  • Volume before Card Issuance (Volume Rollup Window)
  • Balance before Card Issuance (Balance Rollup Window)

The data will first go through a preprocessing pipeline where it gets standard-scaled, categorical variables will get one-hot-encoded and missing values get imputed by the column's mean. This preprocessing pipeline is constructed with the build_preprocessor_pipeline function. To build a preprocessing pipeline custom feature columns can be passed into the function either through the column names or by a regex which is especially useful for the monthly rollup window columns. This function will not only be used for the baseline model but also for all further model candidates.

In [ ]:
from src.plot_utils import (plot_model_evaluation_summary, 
                            plot_confusion_matrix, 
                            plot_lift_curve, 
                            plot_precision_recall_curve, 
                            plot_metrics,
                            plot_param_grid_heatmap)

from src.train_utils import (cross_validate,
                             build_preprocessor_pipeline)

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

baseline_columns = ['client_age', 'client_gender', 'client_district_region']

baseline_preprocessor, baseline_column_selection = build_preprocessor_pipeline(X_train, 
                                                                               include_columns=baseline_columns, 
                                                                               regex_columns='(volume|balance)')

base_pipeline = Pipeline([
    ('preprocessor', baseline_preprocessor),
    ('classifier', LogisticRegression(random_state=1337, max_iter=10_000))
])

(base_estimator, 
 base_params, 
 base_metrics_df, 
 base_roc_curves, 
 base_pr_curves, 
 base_lift_probs, 
 base_true_labels,
 _) = cross_validate(base_pipeline, 
                     baseline_column_selection, 
                     y_train)

model_name = base_estimator.named_steps['classifier'].__class__.__name__

plot_metrics(base_metrics_df, model_name)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 257.50it/s]
No description has been provided for this image

The simple baseline logistic regression model shows relatively consistent performance across the different metrics, with accuracy, precision, recall, and F1 score all at 71% with 3% standard deviation across the 5 folds. The ROC/AUC is slightly higher at 78%, indicating a reasonably good classification performance. The standard deviations indicate that there is some variability in these metrics, but they remain relatively stable.

In [ ]:
plot_model_evaluation_summary(base_roc_curves, base_metrics_df.mean(), base_estimator, X_test[baseline_column_selection.columns], y_test, model_name)
No description has been provided for this image

The baseline model demonstrates a good discriminative ability on the validation set with an ROC AUC of 0.78. This indicates that the model is effective at distinguishing between positive and negative cases.

The confusion matrix shows that the model has a balanced performance with:

  • Accuracy: A high number of true positives and true negatives contribute to the overall accuracy.
  • Precision and Recall: The model maintains consistent performance, as indicated by the balanced number of false positives and false negatives.

Further Model Candidates¶

Since we now have the baseline model set up, lets explore further candidate models that aren't restricted when it comes to variable selection.

To construct further candidate models all variables are used to train the further candidates. In an optimal case it would make sense to perform feature selection for each model independently though this is very compute intensive so every candidate just gets cross validated on the dataset with all features. Additionally a grid search is performed to find suitable parameters that help the models perform better.

After exploring numerous candidates, a comparison will be made to "crown" a final model. The final model will then go through a feature selection process using Recursive Feature Elimination (RFE) on the 100 most important variables according to dropout loss.

Logistic Regression¶

The first model candidate is also a Logistic Regression model but without any constraints on number of features.

In [ ]:
all_vars_prep_pipeline, all_vars_column_selection  = build_preprocessor_pipeline(X_train, include_columns=X_train.columns.tolist())

lr_pipeline = Pipeline([
    ('preprocessor', all_vars_prep_pipeline),
    ('classifier', LogisticRegression(random_state=1337, max_iter=100_000, solver='saga'))
])

lr_param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
}

(lr_estimator,
 lr_params,
 lr_metrics_df,
 lr_roc_curves,
 lr_pr_curves,
 lr_lift_probs,
 lr_true_labels,
 lr_grid_search) = cross_validate(lr_pipeline, all_vars_column_selection, y_train, param_grid=lr_param_grid)

model_name = lr_estimator.named_steps['classifier'].__class__.__name__

plot_metrics(lr_metrics_df, model_name)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 4794.59it/s]
No description has been provided for this image

The logistic regression model that includes all variables shows better performance compared to the baseline model. The improvements in accuracy, precision, recall, F1 score, and ROC AUC could be due to including all features.

In [ ]:
plot_model_evaluation_summary(lr_roc_curves, lr_metrics_df.mean(), lr_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)
No description has been provided for this image

The logistic regression model with all variables yields a better discriminative ability on the validation set with an ROC AUC of 0.85. This is quite a large improvement compared to the baseline model.

The confusion matrix however doesn't show much of an improvement. The confusion matrix is calculated on the test set while the ROC curve is calculated on the validation set, hence we start to overfit on the validation set and don't really improve much more on the test set.

In [ ]:
plot_param_grid_heatmap(lr_grid_search)
No description has been provided for this image

The GridSearchCV identified the best hyperparameters for the logistic regression model based on the AUC values. The optimal combination found is a regularization parameter C of 1.0 with an L1 penalty, achieving the highest AUC score of 0.846.

Decision Tree¶

The decision tree also gets fit on all variables.

In [ ]:
from sklearn.tree import DecisionTreeClassifier

dt_pipeline = Pipeline([
    ('preprocessor', all_vars_prep_pipeline),
    ('classifier', DecisionTreeClassifier(random_state=1337))
])

dt_param_grid = {
    'classifier__max_depth': [5, 10, 20, 30, 50],
    'classifier__min_samples_split': [5, 10, 20, 30, 50]
}

(dt_estimator, 
 dt_params, 
 dt_metrics_df, 
 dt_roc_curves, 
 dt_pr_curves, 
 dt_lift_probs, 
 dt_true_labels,
 dt_grid_search) = cross_validate(pipeline=dt_pipeline, 
                                  X=all_vars_column_selection, 
                                  y=y_train,
                                  param_grid=dt_param_grid)

model_name = dt_estimator.named_steps['classifier'].__class__.__name__
plot_metrics(dt_metrics_df, model_name)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 5440.08it/s]
No description has been provided for this image

The Decision Tree Classifier performs consistently across different metrics, but overall with lower scores compared to both the baseline Logistic Regression and the extended Logistic Regression models.

In [ ]:
plot_model_evaluation_summary(dt_roc_curves, dt_metrics_df.mean(), dt_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)
No description has been provided for this image

The confusion matrix reveals that the model correctly identifies a substantial number of positive and negative cases in the test set but it still performs worse compared to the previously explored models. Both the True Positives and the True Negatives decreased while the False Positives and False Negatives increased a slight bit.

In [ ]:
plot_param_grid_heatmap(dt_grid_search)
No description has been provided for this image

The grid search results show that the Decision Tree performs better with shallower trees and stricter splitting criteria.

Random Forest¶

Since the Decision Tree came out to be worse than the models before, the Random Forest candidate will explore if a more complex architecture of essentially multiple Decision Trees can fit the data in an improved way.

In [ ]:
from sklearn.ensemble import RandomForestClassifier

all_vars_prep_pipeline, all_vars_column_selection = build_preprocessor_pipeline(X_train, include_columns=X_train.columns.tolist())

rf_pipeline = Pipeline([
    ('preprocessor', all_vars_prep_pipeline),
    ('classifier', RandomForestClassifier(random_state=1337))
])

rf_param_grid = {
    'classifier__n_estimators': [10, 50, 100, 200],
    'classifier__max_features': ['sqrt', 'log2']
}

(rf_estimator, 
 rf_params, 
 rf_metrics_df, 
 rf_roc_curves, 
 rf_pr_curves, 
 rf_lift_probs, 
 rf_true_labels,
 rf_grid_search) = cross_validate(rf_pipeline, 
                                  X_train[all_vars_column_selection.columns], 
                                  y_train, 
                                  param_grid=rf_param_grid)

model_name = rf_estimator.named_steps['classifier'].__class__.__name__

plot_metrics(rf_metrics_df, model_name)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 4960.15it/s]
No description has been provided for this image

As hypothesized, the Random Forest performs better than the Decision Tree, hoever, the Logistic Regression approach can still not be beaten when comparing the metrics.

In [ ]:
plot_model_evaluation_summary(rf_roc_curves, rf_metrics_df.mean(), rf_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)
No description has been provided for this image

The ROC curve shows fairly stable cross validation with just one fold performing significantly worse. The confusion matrix on the test set shows an interesting change: The False Negatives decreased compared to the Logistic Regression while the False Positives increased.

In [ ]:
plot_param_grid_heatmap(rf_grid_search)
No description has been provided for this image

The grid search results imply that the Random Forest benefits from a higher number of estimators, which makes sense since we add more variance in the bias-variance-tradeoff. The max_features parameter only had a slight impact on the outcome of the AUC score whereby the square root of the features seems to have been the "best" parameter.

Bagging¶

In [ ]:
from sklearn.ensemble import BaggingClassifier

bg_pipeline = Pipeline([
    ('preprocessor', all_vars_prep_pipeline),
    ('classifier', BaggingClassifier(random_state=1337))
])

bg_param_grid = {
    'classifier__n_estimators': [10, 50, 100, 200],
    'classifier__max_samples': [0.5, 0.7, 1.0]
}

(bg_estimator, 
 bg_params, 
 bg_metrics_df, 
 bg_roc_curves, 
 bg_pr_curves, 
 bg_lift_probs, 
 bg_true_labels,
 bg_grid_search) = cross_validate(bg_pipeline,
                                  all_vars_column_selection, 
                                  y_train, 
                                  param_grid=bg_param_grid)

model_name = bg_estimator.named_steps['classifier'].__class__.__name__

plot_metrics(bg_metrics_df, model_name)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 5599.87it/s]
No description has been provided for this image

The Bagging Classifier achieves roughly the same metrics as the Logistic Regression model, though it still cannot reach the benchmark when comparing the AUC score.

In [ ]:
plot_model_evaluation_summary(bg_roc_curves, bg_metrics_df.mean(), bg_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)
No description has been provided for this image

Also the same phenomenon as in the Random Forest already observed can be seen. The False Negatives decrease in comparison while the False Positives increase by a noticable margin.

In [ ]:
plot_param_grid_heatmap(bg_grid_search)
No description has been provided for this image

The param grid visualization clearly shows that the Bagging Classifier benefits from added model complexity. The best AUC score was reached with 200 estimators and all features.

Boosting¶

In [ ]:
from sklearn.ensemble import GradientBoostingClassifier

bt_pipeline = Pipeline([
    ('preprocessor', all_vars_prep_pipeline),
    ('classifier', GradientBoostingClassifier(random_state=1337))
])

bt_param_grid = {
    'classifier__n_estimators': [10, 50, 100, 200],
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.5]
}

(bt_estimator, 
 bt_params, 
 bt_metrics_df, 
 bt_roc_curves, 
 bt_pr_curves, 
 bt_lift_probs, 
 bt_true_labels,
 bt_grid_search) = cross_validate(bt_pipeline,
                                  all_vars_column_selection, 
                                  y_train, 
                                  param_grid=bt_param_grid)

model_name = bt_estimator.named_steps['classifier'].__class__.__name__

plot_metrics(bt_metrics_df, model_name)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 6082.23it/s]
No description has been provided for this image

The Gradient Boosting Classifier reached slightly better benchmarks compared to the Logistic Regression Model within the metrics Accuracy, Precision, Recall and F1. The AUC score is on par with the previous best model.

In [ ]:
plot_model_evaluation_summary(bt_roc_curves, bt_metrics_df.mean(), bt_estimator, X_test[all_vars_column_selection.columns], y_test, model_name)
No description has been provided for this image

The evaluation summary plot also shows the improvement, especially on the confusion matrix which was caluclated on the test set. Both the False False Positives and False Negatives decreased compared to the other ensemble models though compared to the extended Logistic Regression's results the performance still seems worse on the test set. This indicates that the Boosting Classifier is clearly starting to overfit on the training and validation set.

In [ ]:
plot_param_grid_heatmap(bt_grid_search)
No description has been provided for this image

The plot showing the results of the param grid combinations shows that the model works best with a learning rate at around 0.1 and starts to overshoot the minima at 0.5. And similar to the observations on the previous ensemble models, as expected, this Boosting Classifier also benefits from more complexity where it reaches the best AUC score at a complexity of 200 estimators.

Comparing the Model Candidates¶

In this chapter the trained candidates get compared to each other so that at the end a "best" model can be determined to further improve upon.

ROC Curves¶

The first comparing analysis will be in the context of TP and FP rates (ROC Curves).

In [ ]:
from src.plot_utils import plot_multiple_roc_curves

plot_multiple_roc_curves([base_roc_curves, lr_roc_curves, dt_roc_curves, rf_roc_curves, bg_roc_curves, bt_roc_curves], 
                         ['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'])
No description has been provided for this image

The Boosting Classifier reached the highest AUC score of 0.849 (84.9%) with the lowest standard deviation measured across the 5-fold Cross Validation.

The Tree-Based Ensemble models show an interesting pattern in the ROC curve. This could be due to the fact that these models typically are a step function because Decision Trees predict class labels directly without providing probability estimates. This means that the classifier can only predict a fixed set of thresholds, resulting in a stepwise ROC curve rather than a fine curve.

In [ ]:
from src.plot_utils import plot_auc_boxplots

plot_auc_boxplots([base_roc_curves, lr_roc_curves, dt_roc_curves, rf_roc_curves, bg_roc_curves, bt_roc_curves], 
                  ['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'])
No description has been provided for this image

Overall when looking at the distribution of the reached AUC scores for all models, the extended Logistic Regression (with all variables) reached the highest scores in the 5 folds compared to the Boosting Classifier.

Confusion Matrices¶

To compare the performance on the test set for the candidates, the following plot shows all confusion matrices.

In [ ]:
from src.plot_utils import plot_confusion_matrices

plot_confusion_matrices([base_estimator, lr_estimator, dt_estimator, rf_estimator, bg_estimator, bt_estimator], 
                        X_test[all_vars_column_selection.columns], 
                        y_test,
                        ['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'])
No description has been provided for this image

The two best performing models across the board seem to be the Boostin Classifier and the Logistic Regression with all available features. An observation was that the Boosting Classifier overfits slightly on the True Positives and the Logistic Regression might overfit slightly on the True Negatives.

Other Metrics¶

In [ ]:
average_metrics = []

for name, df in zip(['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier'], 
                    [base_metrics_df, lr_metrics_df, dt_metrics_df, rf_metrics_df, bg_metrics_df, bt_metrics_df]):
    mean_metrics = df.mean()
    mean_metrics['Estimator'] = name
    average_metrics.append(mean_metrics)

average_metrics_df = pd.DataFrame(average_metrics)[['Estimator', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']]

average_metrics_df
Out[ ]:
Estimator Accuracy Precision Recall F1 Score ROC AUC
0 Baseline Logistic Regression 0.710797 0.711504 0.710797 0.710601 0.783406
1 Logistic Regression 0.768968 0.769728 0.768968 0.768752 0.846167
2 Decision Tree 0.714172 0.720616 0.714172 0.712326 0.772888
3 Random Forest 0.758845 0.763921 0.758845 0.757804 0.830956
4 Bagging Classifier 0.763068 0.765643 0.763068 0.762508 0.833745
5 Boosting Classifier 0.774045 0.777111 0.774045 0.773304 0.848636

Top Customer Lists¶

To also see how the model candidates might agree on the overall most certain customers, the following subsection shows the overlap of prediction outcomes between all candidates once for the top 5% and once for the top 10% of customers.

In [ ]:
from src.plot_utils import plot_model_concordance

estimators = [base_estimator, lr_estimator, dt_estimator, rf_estimator, bg_estimator, bt_estimator]
estimator_names = ['Baseline Logistic Regression', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'Bagging Classifier', 'Boosting Classifier']
non_card_holders = cleaned_client_df[cleaned_client_df['has_card'] == False].drop(columns='has_card')

plot_model_concordance(estimators, estimator_names, non_card_holders, p=0.05)
No description has been provided for this image

For the top 5% customers, there seems to be some overlap or agreement between the baseline Logistic Regression and the extended Logistic Regression models. Another area with more agreement is between the ensemble models (Boosting, Bagging and Random Forest classifiers). The highest overlap can be reached between the Random Forest and Bagging Classifier.

In [ ]:
plot_model_concordance(estimators, estimator_names, non_card_holders, p=0.1)
No description has been provided for this image

Adding another 5% on top of the smaller subset seems to overall raise the agreement between all models except between the extended Logistic Regression and baseline Logistic Regression models where the overlap almost shows the same percentage. Especially the ensemble models reach a high overlap. I suspect the higher overlap between these models to be due to their architecture: Models that are part of ensemble methods like Bagging or Boosting are designed to improve predictive accuracy by combining multiple models' predictions. For instance, the Bagging Classifier and Random Forest both use bagging techniques, which could explain their higher overlap.

Choosing a Final Model¶

To now crown a final "winner model", I suggest choosing the extended Logistic Regression. The natural data distribution heavily overweighs with customers without cards; The Logistic Regression model also succeeds the best in classifying that majority class so in hindsight of lowering the cost of "falsely" advertising to customers, this model's confusion matrix seems to minimize that potential advertisement cost.

Model Reduction¶

With the Logistic Model now in place, the chosen model at hand will first be reduced to help the explainability and interpretability. A model's performance can also suffer under high feature complexity so this step will try to optimize the required amount of features.

Selecting Features¶

Since the Recursive Feature Elimination itself can be computationaly very heavy, only the top 50 features (according to dalex's dropout loss) will be reduced by RFE.

The Dropout Loss essentially measures the change in loss whenever the respective variable is "dropped out" or just left out from fitting. The more meaningful a variable is for a model's predictions, the higher its dropout loss value is. Looking at just variable importance (i.e. Dropout Loss) is not inherently the best approach feature selection since it neglects the potential importance of interacting variables. This is where RFE comes in. RFE will also put emphasis on the interaction between variables when selecting "important" features.

In [ ]:
from dalex import Explainer

lr_explainer = Explainer(lr_estimator, X_train[all_vars_column_selection.columns], y_train, label='Logistic Regression')

lr_model_parts = lr_explainer.model_parts(type='variable_importance')

lr_top_variables = lr_model_parts.result.sort_values('dropout_loss', ascending=False).head(51)[['variable', 'dropout_loss']]
lr_top_variables = lr_top_variables[lr_top_variables['variable'] != '_baseline_']

lr_top_variables
Preparation of a new explainer is initiated

  -> data              : 1186 rows 233 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1186 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Logistic Regression
  -> predict function  : <function yhat_proba_default at 0x29c922ca0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.000127, mean = 0.5, max = 0.999
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.98, mean = -3.87e-05, max = 0.868
  -> model_info        : package sklearn

A new explainer has been created!
Out[ ]:
variable dropout_loss
233 pre_card_credit_mean 0.303514
232 credit_month_diff_7 0.280006
231 credit_month_diff_9 0.218071
230 credit_month_diff_6 0.197614
229 credit_month_mean 0.193826
228 pre_card_n_transactions_sum 0.187998
227 credit_month_sum 0.187252
226 credit_month_max 0.166707
225 volume_month_diff_7 0.164795
224 volume_month_diff_9 0.151570
223 balance_month_std 0.144832
222 volume_month_diff_6 0.141864
221 pre_card_n_transactions_mean 0.140690
220 credit_last_quarter_std 0.140570
219 pre_card_volume_std 0.139945
218 volume_last_quarter_max 0.132605
217 n_transactions_month_min 0.132325
216 volume_month_std 0.122966
215 credit_month_diff_8 0.120589
214 volume_month_mean 0.119929
213 volume_month_sum 0.119134
212 credit_month_diff_5 0.114507
211 credit_month_diff_4 0.109794
210 credit_month_median 0.109789
209 n_transactions_month_max 0.109150
208 account_age 0.108561
207 credit_month_diff_3 0.108317
206 volume_month_median 0.107461
205 volume_first_quarter_median 0.105405
204 pre_card_credit_sum 0.105006
203 withdrawal_month_diff_8 0.103334
202 credit_month_diff_11 0.094544
201 volume_last_quarter_std 0.092954
200 withdrawal_month_diff_5 0.092878
199 volume_month_diff_5 0.092126
198 pre_card_volume_min 0.089352
197 n_transactions_month_std 0.088932
196 volume_month_diff_11 0.087723
195 withdrawal_month_diff_9 0.086668
194 credit_first_quarter_median 0.086437
193 credit_month_std 0.086296
192 withdrawal_month_max 0.085812
191 pre_card_n_transactions_median 0.084848
190 volume_month_diff_4 0.084675
189 n_transactions_month_diff_7 0.084351
188 withdrawal_month_diff_1 0.083861
187 pre_card_n_transactions_min 0.081967
186 withdrawal_month_min 0.081871
185 n_transactions_month_diff_13 0.080430
184 n_transactions_month_diff_5 0.079018

Now with the top 50 features selected, a new preprocessor pipeline will be constructed that contains transformations in the context of those variables.

In [ ]:
lr_top_vars_prep_pipeline, lr_top_vars_prep_selection = build_preprocessor_pipeline(X_train, include_columns=X_train[lr_top_variables['variable']].columns.tolist())

lr_rfe_pipeline = Pipeline([
    ('preprocessor', lr_top_vars_prep_pipeline),
    ('classifier', LogisticRegression(**lr_estimator.named_steps['classifier'].get_params()))
])

(lr_rfe_estimator,
 lr_rfe_params,
 lr_rfe_metrics_df,
 lr_rfe_roc_curves,
 lr_rfe_pr_curves,
 lr_rfe_lift_probs,
 lr_rfe_true_labels,
 _) = cross_validate(lr_rfe_pipeline, lr_top_vars_prep_selection, y_train, n_features_to_select=20)
Cross-Validation: 100%|██████████| 5/5 [00:00<00:00, 2018.63it/s]

The RFE-selected features can now be accessed through the RFE pipeline:

In [ ]:
rfe_step = lr_rfe_estimator.named_steps['rfe']
column_ids = rfe_step.get_support()

rfe_selected_columns = lr_rfe_estimator.named_steps['preprocessor'].get_feature_names_out()[column_ids]
rfe_selected_columns = [col.split('__')[-1] for col in rfe_selected_columns]

pd.DataFrame(rfe_selected_columns, columns=['Selected Features'])
Out[ ]:
Selected Features
0 pre_card_credit_mean
1 credit_month_diff_7
2 credit_month_diff_9
3 credit_month_mean
4 pre_card_n_transactions_sum
5 credit_month_sum
6 volume_month_diff_7
7 volume_month_diff_9
8 credit_last_quarter_std
9 pre_card_volume_std
10 n_transactions_month_min
11 volume_month_mean
12 volume_month_sum
13 account_age
14 volume_last_quarter_std
15 pre_card_volume_min
16 withdrawal_month_diff_9
17 pre_card_n_transactions_median
18 pre_card_n_transactions_min
19 n_transactions_month_diff_13

The credit and volume metrics on the entire rollup window seem to be mostly represented in the feature selection by the RFE. The customer behavior, or at least some statistical measures of the behavior, in the 13 months before issuance therefore seem to be favorable markers for the model.

In [ ]:
plot_metrics(lr_rfe_metrics_df, 'Logistic Regression with RFE')
No description has been provided for this image
In [ ]:
plot_model_evaluation_summary(lr_rfe_roc_curves, lr_rfe_metrics_df.mean(), lr_rfe_estimator, X_test, y_test, 'Logistic Regression with RFE')
No description has been provided for this image

The logistic regression model with RFE (20 selected features) performs comparably to the model with all variables, with a minor trade-off in the ROC AUC score (0.84 vs. 0.85). The RFE model shows a slight improvement in identifying positive cases (higher TP, lower FN) but has a marginal increase in false positives. Overall, using RFE to reduce the number of features to 20 provides a similar performance while offering benefits in terms of model simplicity and computational efficiency so the tradeoff is barely noticable.

In [ ]:
plot_lift_curve(lr_rfe_lift_probs, lr_rfe_true_labels, 'Logistic Regression with RFE')
No description has been provided for this image

The Lift Curve shows that the new proposed model is about twice as effective in classifying non-card customers as randomly guessing an outcome for 10% of the client base. At 50% of the customer base the model still is just above 1.5 times more certain at predicting if someone will buy a card or not.

Model Explanation¶

The last step is to get a better grasp on how the model makes decisions and ultimately predictions. To dive deeper into explainability, the package dalex will be utilized.

First, an Explainer both in the context of the train split and the test split is created. As explained in the paper DO NOT EXPLAIN WITHOUT CONTEXT: ADDRESSING THE BLIND SPOT OF MODEL EXPLANATIONS it makes sense to look at all splits since for example Partial Dependence Plots not only convey information about the model but also the underlying data these models were trained on.

Since dalex unfortunately is still in development for Python some bugs emerge. For example, when trying to work with an sklearn Pipeline, dalex has difficulties transforming the data and displaying the correct feature names for its plots - Hence the process below first shows a rather laborious way of transforming the selected features through all pipeline steps and then passing the estimator and transformed data to the Explainer. Since dalex expects barebone sklearn estimators (not wrapped in a Pipeline) an error was thrown repedeatly for every feature; This is why the warnings at this point get suppressed to not clutter the notebook.

In [ ]:
warnings.filterwarnings("ignore")

X_train_rfe_prep = lr_rfe_estimator.named_steps['preprocessor'].transform(X_train[all_vars_column_selection.columns])
X_train_rfe_selected_cols = lr_rfe_estimator.named_steps['rfe'].transform(X_train_rfe_prep)
X_train_rfe_selected_cols = pd.DataFrame(X_train_rfe_selected_cols, columns=rfe_selected_columns)

X_test_rfe_prep = lr_rfe_estimator.named_steps['preprocessor'].transform(X_test[all_vars_column_selection.columns])
X_test_rfe_selected_cols = lr_rfe_estimator.named_steps['rfe'].transform(X_test_rfe_prep)
X_test_rfe_selected_cols = pd.DataFrame(X_test_rfe_selected_cols, columns=rfe_selected_columns)

test_explainer = Explainer(lr_rfe_estimator.named_steps['classifier'], X_test_rfe_selected_cols, y_test, label='Reduced Logistic Regression (test set)')
train_explainer = Explainer(lr_rfe_estimator.named_steps['classifier'], X_train_rfe_selected_cols, y_train, label='Reduced Logistic Regression (train set)')
Preparation of a new explainer is initiated

  -> data              : 743 rows 20 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 743 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Reduced Logistic Regression (test set)
  -> predict function  : <function yhat_proba_default at 0x29c922ca0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00139, mean = 0.393, max = 0.999
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.992, mean = -0.186, max = 0.969
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 1186 rows 20 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1186 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Reduced Logistic Regression (train set)
  -> predict function  : <function yhat_proba_default at 0x29c922ca0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00252, mean = 0.5, max = 0.999
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.98, mean = -1.92e-07, max = 0.902
  -> model_info        : package sklearn

A new explainer has been created!

Evaluation Metrics¶

To get a first glance at the performance metrics on the model, the following block displays the model_performance().

In [ ]:
model_performance = pd.concat([train_explainer.model_performance().result, test_explainer.model_performance().result])
model_performance
Out[ ]:
recall precision f1 accuracy auc
Reduced Logistic Regression (train set) 0.806071 0.784893 0.795341 0.792580 0.874579
Reduced Logistic Regression (test set) 0.798701 0.460674 0.584323 0.764468 0.854001

We have already seen the performance metrics on the Train Set when the model was trained before. The test set however now shows a much lower precision and therefore also a lower F1 score. This however may just be due to the fact that the test set stayed untouched in terms of class distribution.

Variable Importance¶

Now we take a look at the variable importance (dropout loss) for the remaining top variables.

In [ ]:
mp_lg_train = train_explainer.model_parts(type='variable_importance')
mp_lg_test = test_explainer.model_parts(type='variable_importance')

mp_lg_train.plot(mp_lg_test)

For both sets the same 8 first variables show roughly the same dropout loss. After these the variable importance changes between the both sets:

  • Within the Train Set the number of transactions before card issuance and the account age seem to play a bigger role than for the Test partition.
  • Within the Test partition the number of transactions in the first month of the rollup window and the withdrawal happening in the 9th month before issuance play a larger role.

Partial Dependence¶

Each following subplot shows the partial dependence for both the train set (green line) and the test set (blue line). The Partial Dependence shows the consistency of the model's learned relationships across both datasets for the reduced Logistic Regression model. The PDPs can show how changes in each variable, while keeping others constant, influence the predicted outcome.

In [ ]:
pd_lr_train = train_explainer.model_profile(center='mean')
pd_lr_test = test_explainer.model_profile(center='mean')

pd_lr_train.plot(pd_lr_test)
Calculating ceteris paribus: 100%|██████████| 20/20 [00:00<00:00, 79.78it/s]
Calculating ceteris paribus: 100%|██████████| 20/20 [00:00<00:00, 171.77it/s]

Both within the test and train partition the Partial Dependence for all features seems to have very similar structure. Most variables show a linear trend, however the variables credit_month_mean (Mean of Credit amount during the Rollup Window) and credit_month_sum (Sum of Credit amount during the Rollup Window) show a more squished together sigmoid function where high credit statistics lead to a more likely prediction of 1 (has_card=True). The volume-related statistics (except for pre_card_volume_min) on the other hand, show the opposite: The higher the volume stats, the less likely the model is to predict a customer having a card.

Variable Attribution¶

In [ ]:
account_ids = [14, 18]

selected_clients = client_df[client_df['account_id'].isin(account_ids)]

for account_id in account_ids:
    client = selected_clients[selected_clients['account_id'] == account_id]
    
    va_expl = train_explainer.predict_parts(client[rfe_selected_columns], type='shap', label=f'Account {account_id}')
    
    va_expl.plot()

Comparing the SHAP plots for Account 14 and Account 18, we can observe distinct differences in the feature contributions. For Account 14, the largest negative contributor is volume_month_sum (-0.797), while for Account 18, it is also volume_month_sum (-0.52), but with a slightly lesser impact. Both accounts see significant positive contributions from volume_month_diff_7 (+0.297 for Account 14) and volume_month_mean (+0.32 for Account 18). Additionally, Account 18's positive contributions are more distributed, with credit_month_sum and volume_last_quarter_std also playing significant roles, while Account 14 has fewer strongly positive features. Overall, Account 18 has a more balanced distribution of feature impacts compared to Account 14.

Conclusive Value of the Final Model¶

The final card purchase prediction model shows meaningful improvement for targeting potential buyers. The results show that for the top 20% of the customer sample, the success rate can almost be doubled compared to a 50/50 baseline. By focusing on the key factors like transaction volume and credit amounts, more precise market strategies can be launched.

In short, the new model shows appropriate robustness that will improve the banks ability to connect with the right customers with the right card offers.